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ABSTRACT 


This report describes the Cluster Compression Algorithm (CCA), which 
was developed to reduce costs associated with transmitting, storing, dis- 
tributing, and interpreting Landsat multispectral image data. The CCA is a 
preprocessxng algorithm that uses feature extraction and data compression 
to more efficiently represent the information in the image data. The for- 
mat of the preprocessed data enables simple look-up table decoding and 
direct use of the extracted features to reduce user computation for either 
image reconstruction, or computer interpretation of the image data. 
Basically, the CCA uses spatially local clustering to extract features 
from the image data to describe spectral characteristics of the data set. 

In addition, the features may be used to form a sequence of scalar numbers 
that define each picture element in terms of the cluster features. This 
sequence, called the feature map, is then efficiently represented by 
using source encoding concepts. Various forms of the CCA are defined 
and experimental results are presented to show trade-offs and character- 
istics of the various implementations. Examples are provided that 
demonstrate the application of the cluster compression concept to multi- 
spectral images from Landsat and other sources. 
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CHAPTER I 
INTRODUCTION 

A. Objectives of Data Redundancy Reduction 

Early removal of redundant data is iuq>ortant in reducing costs and. 
time delays in the link between the sensor data and the users' results, 
or alternately early removal of redundancy can be used'to obtain higher 
performance. Data redundancy reduction, or information extraction is 
especially important in data systems involving high rate remote data 
sensors which require related storage, communication, archiving, distri- 
bution and Interpretation subsystems. Imaging data systems are 
important examples for application of early information extraction. 

Data rate reductions, or compression ratios in image approximation 
applications often approach and may exceed an order of magnitude. In 
applications of computer classification of the image data, the rate 
reductions may be many orders of magnitude. . An ever increasing trend 
toward automatic Interpretation emphasizes the need for data redundancy 
reduction jointly suited to conq>uter classification and approximation of 
Images. 

The objective of this work is to develop a joint feature 
extraction/data compression technique for removal of redundant data in 
image approximation and computer classification applications. A tech- 
nique is described which jointly applies clustering and source encoding 
concepts to obtain data compression. An algorithm, called the Cluster 
Compression Algorithm, for implementing this concept is investigated 
with emphasis given to practical data system considerations. 
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B. Practical Data System Considerations 

The emphasis in 'this work' is on the development of a practical 
information extraction technique which -benefits the entire data system. 
The performance of a data compression technique must be judged by more 
than its compression ratio. Other system aspects for which the Impact 
and interaction of data compression should be assessed are data storage, 
transmission and channel errors, distribution and archiving, and user 
interpretation, 

A block diagram of a practical data system is shown in Fig. 1.1. 

All of the subsystem elements shown in Fig. 1.1 are not present in 
every application. For example, in many systems the sensor- located 
storage, channel coding/ decoding, archiving and distribution are 
unnecessary. The sensor design has an important impact on data redun- 
dancy. The sensor acquires data which may range from inadequate to 
more than adequate to provide the information desired, but in either 
situation there is usually considerable extraneous measurement data. 
Flexible instrument control strategies can be used to reduce data 
redundancy, for example, through control of measurement rate and 
resolution, dynamic range scaling, and simple on-off functions. 

Channel errors have a more serious impact on data after it has been 
compressed. Ideally one would like to jointly solve the problems of 
source encoding to remove redundancy and channel encoding for error 
protection. From a practical standpoint, however, the present 
approaches usually involve using Increased transmitter power to lower 
the error rate, and the use of periodic references in the compressed 
data for source decoding restarts to limit the source decoding error 
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propagation due to a channel error* An alternate approach derived In 
[l] uses channel encoding to group channel errors into long bursts. 

Since cotipressed data is degraded nearly equally by a single error and. 
a burst of errors, the lase of channel coding to group errors permits 
using a higher average error rate. 

In high rate data systems involving large scale archiving, distri- 
bution and interpretation, redundant data can cause serious cost and 
time delay problems. However, data compression may not benefit these 
subsystem functions if complicated decoding is required before selec- 
tive access of the data for distribution ^d interpretation is possible. 
Benefits would result if the data compression could be obtained through 
extraction of only essential features which can be selectively accessed 

and interpreted directly without extraneous decoding. 

* 

The preceding discussed the interactions of practical data system 
functions which need to be considered when incorporating data compres- 
sion into a -system. Now consider the desirable characteristics of the 
data compressor itself. In general, the process of redxindancy removal 
might occur in various stages, at the sensor and later in the communi- 
cation link. In many cases a data system requires data compression 
for widely different applications. Even for a dedicated application, 
the user definition of adequate quality is an \incertainty which suggests 
a need for capability to change. Therefore, a good data compressor 
emphasizes flexibility, allowing a high performance rate vs. qtiality 
trade-off over wide ranges of compression ratios and applications. In 
order to aid the user in supervising the redundancy removal, a data 
compressor should use features which provide easy understanding of the 
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relationship between the feature quality and the resulting quality of 
image approximation, or computer classification. Basically, the data 
compressor should allow the user to attempt reduction of the data 
directly to the desired Information. 

C. Landsat As a Specific Practical Example 

In view of the emphasis in this work on practical data compression 
an existing data system will be used throughout to exemplify the results 
of investigating the Cluster Compression Algorithm. A particularly 
demanding, but practical data system for applying data redundancy 
reduction is NASA's Landsat, formerly known as the Earth Resources 
Technology Satellite (ERTS) . Landsat I is in a sun synchronous polar 
orbit at an altitude of 915 km and orbits the Earth every 103 minutes. 

A block diagram of the major elements in the Landsat data system is shown 
in Fig, 1.2, and additional details of these elements can be found in [2], 
There are other sensors on-board Landsat I, but attention here is 
focused on data-management for the high rate Multispectral Scanner (MSS) 
imaging sensor. The MSS sensor continuously scans 6 lines in 4 spectral 
bands between -0.5 to 1.1 ym. The instantaneous field of view (IFV) is 
79 m on a side and the cross track motion of the IFV is 56 m per sample. 
Every picture element is a vector with four spectral components, each 
digitized to one of 64 levels (6 bits) of radiation intensity. A new 
picture element is scanned approximately every 1.6 ys producing a 15 
Mbps data rate. The resulting PCM data is temporarily stored on a Wide 
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Band Video Tape Recorder and/or used to FSK modulate a carrier tor 

”5 

transmission to Earth with an average error rate less than 10 

The NASA Data Processing Facility (NDPF) at Goddard Space Flight 
Center in Greenbelt, Maryland segments the continuous MSS data into 
185 Km square scenes and provides standard output products of black and 
white, and color negatives, positives and prints in 70 mm and 241.3 mm 
enlargements as well as computer compatible digital tapes. The NDPF 
also provides radiometric, geometric and other corrections to the ^SS 
data, but it does not perform any redundancy reduction, since the satel- 
lite is presently a research vehicle. The NDPF products are archived 

J 

and distributed to many users over the entire world. 

User analysis techniques include film interpretation, computer 
classification, predictive modeling and combinations of these methods. 

By analyzing the sampled electromagnetic energy in four spectral windows 
the investigators can extract information from the image data which is 
useful in water and crop management, urban analysis, pollution detec- 
tion, weather and flood prediction, mapping, and many other applica- 
tions. Discussions of applications of Landsat MSS data are given in 
[Sj and [4]. 

Presently there is no data compression on-board the satellite, 
because Landsat has been used principally as a research vehicle to deter- 
mine the beneficial applications of remote sensing. The future outlook 
is for operational satellites which gather data for specific purposes 
determined to be useful through previous research. These operational 
satellites will increase their capabilities by incorporating higher 


7 



77-43 


spatial and spectral resolution. A reasonable improvement to 10 m 

spatial resolution and 10 band spectral resolution, however, results in 

9 

an increase in data rate to over 2 x 10 bps, justifying the concern 
over the often mentioned data explosion. 

For the Landsat data system one must consider more than just the 
use of on-board data compression to reduce the channel transmission 
rate requirements. An important part of the Landsat data management 
problem is the ground data handling (e.g. , data storage, retrieval, 
distribution and intepretation) . Furthermore, the Landsat data 
system is an example which involves both user needs of visual image 
interpretation and computer classification. Thus this data system 
serves as a good example application for a joint feature extraction/ 
data compression technique. 

D. Prior Efforts in Feature Extraction and Data Compression 

Historically, the emphasis in data compression has been in develop- 
ing techniques to approximate the raw data with fewer bits required for 
transmission. Techniques such as DPCM Cs], Transform methods [6] - 
[s], and combinations of these techniques combined also with variable- 
length entropy encoding [9] have received much attention in the litera- 
ture. Such approaches are suitable, for example, in applications 
concerned with subjective image quality, but they are less suitable in 
applications involving feature extraction or classification of the 
data. For example, a feature of interest in the image may be difficult 
to emphasize in terms of differences used in DPCM, Hadamard coeffi- 
cients in Hadamard Transform techniques, or in terms of the principal 
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components resulting from the Karhunen-Loeve Transform. Even more 
so for classification applications, it would generally be difficult 
to tailor the type of data degradation introduced so as to tend to 
preserve classification accuracy. This decreases the efficiency in 
redundancy reduction and complicates the user requirement input inter- 
face, Furthermore, these approaches generally require decompression of 
the conq>ressed data to its expanded form before selective access and 
interpretation is possible. 

In those applications where features or classification are the 
desired end product, it is reasonable to consider the use of pattern 
recognition theory. Pattern recognition and the subsets of feature 
extraction and classification are discussed with extensive references in 
the surveys of [lO] and [llj. 

A very common classification approach is supervised statistical 
classification with a parametric model of multivariate Gaussian [l2]. 
Classification -for Landsat data is usually done on a per picture element 
basis because of the large volume of data and extra computation required 
when texture is used in classification. A discussion and analysis of 
using texture in classification is given in [13]. 

Much work in feature extraction has centered around finding 
measures of distance between distribution functions, such that the 
distance measures are well correlated with the probability- of error 
associated with the distribution functions. These measures of dis- 
tance are easier to calculate than the probability of error and, 
therefore, reduce the computation required to determine the performance 
of selected features in the sense of classification accuracy. A 
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discussion of such distance measures and their uses are given in 
references [l3, pp. 27-47] and [l4]. 

Other often used feature extraction and classification is based on 
the nonsupervis'ed techniques' of clustering [15], [l6]. Clustering is a 
technique for extracting statistical features from a data set by group- 
ing data vectors of similar characteristics* The clustering of data 
into groups can be used to directly provide nonsupervised classifica- 
tion, or to provide statistical features for groups of data from which 
further classification can be conducted. The application of clustering 
for feature extraction and nonsupervised classification to remote sens- 
ing applications such as Landsat is discussed in [l3], El7], and [l8]. 

Feature extraction and classification as described in the above 

m 

references are primarily considered interpretation techniques to be per- 
formed by the user. However, if the feature extraction or classification 
is performed at the source it is clearly providing data compression. For 
example, if every picture element was classified to one of eight classes 
at the data source, then a total of three bits could be used instead of 
the six bits per each band originally used to define the spectral signa- 
ture of each picture element. This type of data compression also 
directly provides the user with desired results, eliminating the step of 
first decoding the compressed data to obtain an approximation of the raw 
data before interpreting. One obvious problem with using feature 
extraction and classification at the data source is the large computa- 
tional complexity associated with these techniques. Another practical 
problem, often overlooked, is the difficulty in providing supervision 
to the classifier located with a remote sensor. 
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The prior research basically consists of the following! 1) data 
compression techniques suitable for image approximation* but not well 
suited for feature extraction or classification; and 2) feature extrac- 
tion and classification techniques not well suited for image approxi- 
mation* or for practical Implementation at remote sensors. This 
dissertation develops a technique which uses feature extraction to 
obtain data con^ression. Modifications to the feature extraction 
technique of clustering are Investigated which make it useful for 
image approximation as well as practical to implement. In addition, 
the data compression technique of entropy coding is combined with 
feature extraction to further increase performance. This approach 
of joint feature extraction/ data compression is called the Cluster 
Compression Algorithm and some of the initial investigation results 
are contained in [l9] and [203. 

E. A Joint Feature Extraction/Data Compression Model 

Assume the data source is the multlspectral image source modeled 
by the continuous random process s(y^,y 2 »w) which is the electro- 
magnetic energy at wavelength w for spatial coordinates y^ and ■^ 2 ' 

The measurement vector elements of a digitized d-band multlspectral 
image are then represented by the vector X(Yj^*Y 2 ) obtained from 
s(yj^*y 2 »w) by discretizing the variables y^, y 2 and w to give 

X(Yi*Y2)»[s(Yj^*Y2,W^) ,s(Y^,Y 2*W2) , . . . ,s(Y3^*Y2*W^) ] . (1.1) 

Figure 1.3 shows a model for the general concept of joint feature 
extraction/ data compression. The Cluster Compression Algorithm defined 
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Fig. 1.3* A Joint Feature Extraction/ Data Compression Model. 


in the next chapter is a specific form of this more general model. Let 


be a sequence of n measurement vectors obtained from a subset 
of multispectral image data. The entire sequence analyzed 

to extract features for the sequence These features 

could also be considered as primitives, or as a basis for approximating 
the measurement vectors. The sequence of m features provides a 

description of the characteristics pertaining jointly to the entire 

^ ■■ 

sequence of measurement vectors {X and in some applications may be 

the only output required. In other applications each measurement 

vector needs to be approximated in terms of by the sequence 

Ic o 

of scalar numbers {y which assigns to each measurement vector an 

approximation in terms of one of the primitives. Alternately, the 
concept of Fuzzy Set Theory [2l] might be used to allow each measure- 
ment vector to be described by a weighted mixture of more than one of 
the features in the sequence ,. The scalar sequence {y^}f , 

constitutes a spatial map of the measurement vectors in terms of the 
primitives, and is called the feature map. Spatial features can be 
extracted from {y source encoding can be used to efficiently 

represent the spatial characteristics of through the encoded 

feature map denoted by C^. For each sequence of measurement 
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vectors the model basically uses feature extraction concepts to 
determine a set of features or primitives, and then the model uses data 
compression techniques to efficiently represent the spatial features in 
terms of the primitives. 

The Cluster Compression Algorithm uses clustering to extract 
multidimensional primitives from subsets of spatially contiguous measure- 
ments in the image data and then applies entropy encoding to the 
feature map. Another independently developed data compression technique 
related to the model in Fig. 1.3 is the Blob algorithm [22] . This 
algorithm is basically a boundary finding algorithm that guarantees 
closed botmdaries. A blob is defined as a connected set of picture 
elements all of which have some common characteristic. In the Blob 

algorithm the primitives are defined by the boundary finding algorithm 

\ 

and consist of statistical descriptions of the measurement vectors 
contained within the various blobs. Spatial definition can then be 
provided by encoding a sequence which defines the spatial boundaries 
of the blobs. This algorithm is most useful as a data compressor when 
the image consists of well defined boundaries enclosing picture elements 
of nearly uniform characteristics, (e.g., agricultural fields). The 
Cluster Compression Algorithm is different in that it extracts primi- 
tives and obtains significant con 5 >ression Independent of the spatial 
characteristics of the data. Thus the cluster compression technique 
can efficiently compress images whether or not they contain uniform 
areas with well defined boundaries. In addition, if the image does 
consist of uniform areas with well defined boundaries, such as fields, 
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the entropy - encoding of the feature map results in efficient repre- 
sentation of these spatial boundaries. 

P, Organization of the Dissertation 

In Chapter II the use of clustering as a feature extractor is dis- 
cussed and several clustering algorithms are defined for the CCA, In 
Chapter III the basic Uncoded CCA is defined and examples are included 
of its xjse in image approximation and automatic image classification. 
Chapter IV investigates entropy bounds for a class of entropy 
coders and then defines an entropy coder for use in a Coded CCA 
configuration. Simulations are used to compare the performance of the 
entropy coder with the performance bounds. 

Alternate clustering forms of the CCA are defined in Chapter V. 
These include the Adaptive CCA and the Cascaded CCA. The performance 
of the Uncoded, Co'ded, Adaptive and Cascaded CCA are investigated in 
Chapter VI. Computer simulation results are used to compare the 
performance of-various CCA options in terms of percent mean square 
error, subjective image appearance, and classification accuracy as a 
function of data rate. Conclusions and recommendations for further 
research are contained in Chapter VII. 
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CHAPTER II 

CLUSTERING AS A FEATURE EXTRACTOR IN DATA COIIPRESSION 

A. Clustering for Spectral Intensity Feature Extraction 

Now consider the selection of features for use in the Cluster 
Conqiression Algorithm (CCA) . Let ® sequence of 

n d~dimensional spectral intensity measurements for a spatially local 
region of a multispectral image. These vectors generally tend to form 
groups in multispectral intensity space. A typical plot of for 

two bands in spectral intensity space is shown in Fig, 2,XCa). An 
intuitively natural set of features to use for representing {IC 
Fig. 2.1(a) are the descriptions for groups of measurement vectors, for 
example, the groups depicted in Figs. 2.1Cb) - 2.1(d). Each feature 
then consists of whatever set of parameters are used to describe the 
source distribution for the group of measurement vectors. The mean and 
covariance are an obvious example of a group description, or feature. 
The number of -features extracted from any sequence of measurement 
vectors woiJ.d depend on how accurately must be represented, and 

correspondingly what data rate is acceptable. An example of how two, 
four, or eight features might be chosen is shown in Figs, 2.1(b) - 
2.1(d). 

Clustering is a means for automatically grouping multidimensional 
data [ll], CiS], Cl6]. The features resulting from clustering a set of 
measurement vectors are typically a subset of the following: cluster 

means, cluster variances per band, cluster covariances, number of 
vectors in each cluster, set of intercluster distances, etc. Anything 
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BAND 2 






Fig. 2.1. Typical display in two band spectral intensity space for measurement vectors 
and for groups of measurement vectors used as features, (a) Measurement 
vectors '^^wo groups, (c) Four groups. (d) Eight groups. 
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contributing to the description of the cluster can be considered a 
feature, even the set of vectors themselves. Clustering is vised at the 
Laboratory for Applications of Remote Sensing (LARS) at Purdue Univer- 
sity to aid ground analysis of Landsat MSS image data [12]. LARS mainly 
uses clustering to aid in selecting training sets from the image data 
for use in supervised classification, and also to verify the unimodal 
assumptions for the data. Clustering has also been used as a nonsuper- 
vised classifier of Landsat data in studies by Johnson Space Flight 
Center [17]. The self-scaling and adaptive characteristics of cluster- 
ing together with the convenient supervision requirements suggest it 
might also be useful as an on-board feature extractor for data com- 
pression purposes. 

First consider the efficient manner in which cluster features used 
by a data compressor can preserve classification accuracy. Let 
Fig. 2.2(a) represent sample conditional distributions for class A and 
class B whi ch ar e very separable, and let Fig. 2.2(b) represent two 
classes which are less separable. The dashed line represents a good 
classification boundary for the sampled data sets. The user generally 



(») (b) 


Fig. 2.2 Example of Two Classes in Two Dimensional Spectral 
Space, (a) Separable classes (b) Nonseparable 
classes requiring more clusters. 
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knows the separability of the classes, but the distributions and 
boundaries move around in spectral intensity space due to disturbance 
effects on the measurement process. Clustering will be relatively 
insensitive to this shifting since it inherently tracks the data. The 
user can supervise the clustering so as to conserve classification 
accuracy by simply specifying control parameters such as how many clus- 
ters, or maximum variance per band allowed for a cluster, etc. For 
example, assume the data in Fig. 2.2 is compressed by approximating each 
data vector by one of the cluster means. Then in Fig. 2.2(a) two 
clusters are adequate while in Fig. 2.2(b) four or more might be 
required for preserving classification accuracy. If other than Class A 
and B data is present, then a specification of maximum cluster variance 
per band could be used in -determining the number of clusters to use, and 
the allowed cluster variance would be larger in Fig.* 2.2(a) than in 
Fig. 2.2(b). There exists a relatively close relationship between clus- 
ter features and data features typically used in supervised 
classification . 

Clustering can also be used by a data compressor to provide a very 
sophisticated adaptive multidimensional quantizer for obtaining image 
approximation. Image reconstruction can consist simply of using the 
cluster mean feature for every vector in that cluster. Again simple 
user Inputs such as the number of clusters, variance per cluster, etc., 
are easily related to the quality of image approximation. Thus, cluster 
features appear to comprise an efficient and easily interpreted set of 
features for data compression in both classification and image approxi- 
mation applications. However, we must yet consider the modifications 
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necessary to the usual clustering approaches in order to make more 
practical the use of clustering for data compression. 

B. Clustering Approaches 

1. Basic Clustering Algorithm 

The CCA could use any of the many approaches for grouping 
data [ll], [15], [16], [23] - [25]. The clustering algorithms investi- 
gated in this dissertation are all derived from the basic iterative 
approach to clustering referred to in this work as the Basic Clustering 
Algorithm (BCA) and shown in Fig. 2.3. This approach is chosen because 
it requires simple repetitive computations, and can be structured in a 
highly parallel manner for very high data rates. 

i 

Consider the assignment step shown in Fig. 2.3. Let X , i=l, 

2, . . . , n be the ith d-dimensional data vector which is to be 
assigned to one of m cluster centers, or means, j=l* 2, ...» m. 
The components of these vectors are defined by X^ = > 

, and . . . , assignment of data vectors to 

clusters depends on the definition of distance between a data vector and 
a cluster center. In general, the assignment of vectors to clusters 
could involve a distance measure dependent on more than the cluster 
center [24], [25], For example, the distance measure might involve the 
sample covariance of the cluster to account for the possibly nonsym- 
metric distribution of its members. Furthermore, the assignment of 
vectors to clusters could be probabilistic as in Fuzzy Set Theory [2l], 
where the degree of assignment of a vector to each cluster would depend 
on the relative distance between the vector and each cluster. However, 
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DONE 


Figure 2.3. The Basic Clustering Algorithm. 


since practical considerations are being emphasized only cluster means 
and Fuzzyless assignments are considered here. In addition, only the 
most often used and easier to calculate distance measures are investl* 
gated. The most common distance measure is the Euclidean distance. Let 
the Euclidean distance between data vector and cluster center ^ 
be defined by 
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( 2 , 1 ) 


Another distance measure which requires even less calculation is the 
absolute value distance, which is defined by 

d 

k=l 

Thus'assinne the assignment task of the BCA consists of assigning each 
data vector to the closest cluster, where the distance is measured by 
(2.1) or (2.2). 

After assigning every data vector to a cluster, there is a check to 
see if the iterating procedure should be terminated. In computer 
interpretation applications this check consists of terminating when the 
clustering has .converged, or equivalently, when no data* vectors have 
changed cluster assignment. For practical data compression the iterat- 
ing should terminate if either the clustering converges, or if a chosen 
maximum number ISTOP of iterations are reached. The impact on cluster- 
ing quality due to limiting the nuniber of iterations is investigated in 
Chapter VI. 

A change in cluster assignment for one or more data vectors usually 
results in a change in the cluster centers. Thus as shown in Fig. 2.3, 
the cluster centers need to be recalculated. Let represent the set 
of all data vectors belonging to the jth cluster, and the number of 
data vectors in (7^ . Then the cluster centers are simply recalculated 
according to 


4 “ 4 


( 2 . 2 ) 
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n 


I 

{i:X^e 


(2.3) 


where j = If 2, . . . , m and k = !» 2, . . . , d. 

The only user supervision required for the BCA is the number of 
clusters desired (ISTOF is assumed to be specified once at time of 
implementation only) . The number of clusters obtained is generally the 
same as the number of initial cluster centers upon entering the 
algorithm. However, in some cases a cluster may die during the iteration 
process due to no data vectors being assigned to it. The BCA is 
somewhat insensitive to the choice of intitial cluster centers. 
Basically, the initial cluster centers should be well scattered through- 
out the set of all data vectors. Let jC = (Cj^, C 2 f ...» c^) and 
V = (Vj^., V 2 » . . . , v^) be the mean and variance respectively of all 

the data vectors to be clustered. Then one method of choosing 

i "1 1 *1 

m Initial cluster centers, “ ^*^11* ^12’ • • • » defined by 

where j = 1, 2, . . . , m and m > 1. Equation (2.4) places the 
m initial cluster centers evenly spaced on the diagonal of positive 
correlation through the hyperrectangle enclosing plus and minus one 
standard deviation about the mean in each band. 

The BCA of Fig. 2.3 heavily emphasizes simplicity for practical 
implementation. This is particularly true if an absolute value distance 
measure is used and if there is a limit placed on the number of itera- 
tions allowed. The BCA is only the core of more sophisticated 
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clustering algorithms used In data Interpretation. However^ it is 
shown in Chapter III that very simple clustering is useful for data 
compression purposes. Observe that although only the cluster centers 
were used in the above algorithm, other cluster features such as 
variance, covariance, etc., could be obtained if desired by calculating 
these features once upon terminating the BCA. 

2. Adaptive Clustering Algorithm 

The BCA generates a predetermined number of features per data set, 
which would correspond to a rate controlled mode for a data compressor. 
However, for some applications it is desirable to have the ntanber of 
clusters used per data set be variable and adaptively determined to meet 
certain quality requirements. There are many ways to modify the BCA to 
make it adaptive. For example, clusters might be split, combined, or 
deleted during the iterative clustering based on ifltracluster and inter- 
cluster distance measures. A widely used algorithm, ISODATA, with these 
adaptive traits was originated by Ball and Hall £26]. Modifications of 
- ISODATA for ground data evaluation of Landsat multispectral data pro- 
duced ISOCLS [17], [is], and HSCLAS [27]. An Adaptive Clustering 
Algorithm (ACA) is defined in Fig. 2.4 for use in investigating adaptive 
clustering for data compression. The ACA results from simplifications 
of the ISOCLS approach. Observe that the ACA consists of the BCA as a 
core with the added capability of modifying the number of clusters 
between iterations. However, changes in the number of clusters is 
inhibited in the last two iterations, where only the BCA core is used. 
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USER SUPERVISION 

MINIMUM NUMBER OF COISTERS LL 
MAXIMUM NUMBER OF CLUSTERS UL 
DELETE THRESHOLD T, 

SPUT THRESHOLD T^ 
COMBINE'THRESHOlD'Tg’ ' 
NUMBER OF INITIAL CLUSTERS 


CALCUUTE INITIAL CLUSTER aNTERS 


INITIALIZE ITERATION COUNTER 
NITER* 0 




TO NEAREST 
CLUSTER CENTER 


CONVERGE 
OR ISTOP 
ITERATIONS 


CALCULATE UVN 
CLUSTER CENTERS 


NITER* NITER + 1 


<ISTOP-2 


DELETE CLUSTER 
IF n {<Td 


= ISTOP-2 


KISTOP-2 


COMBINE 


CHECK EACH CLUSTER PAIR 
SMALLEST Oj. FIRST 

COMBINE IF 
m>LL 


CHECK EACH CLUSTER 
LARGER Oj FIRST 

SPUTIF 

m<UL 


n,X2T, + 1> 
’ "ANO« 
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in order to tend toward clustering convergence before reaching the maKi- 
mum number of iterations. 

Xhe user supervision of the ACA is still simple with only the addi- 
tional inputs of limits and thresholds. The limits LL and UL specify 
the lower limit and upper limit respectively for the number of clusters. 
From a practical standpoint » these limits provide important control in 
trading-off the degree of adaptability vs. the system complexity. The 
thresholds for deleting, splitting, and combining are defined respec- 
tively by T^, T^, and T^. Deleting consists of just eliminating a clus- 
ter center for which the number of vectors assigned to it are less than 
Tj. This corresponds to a minimum density requirement in order for a 
type of data to be important. Defining a split threshold corresponds to 
specifying an intracluster distance value above which the cluster is 
likely to consist of more than one class of interest. Each cluster can 
be split at most once per splitting iteration. A necessary condition 
for splitting the j-th cluster is that the intracluster distance D^ be 

greater than the splitting threshold T . However, if the number of 

s - 

elements in the j-th cluster- is not much more than T^, then splitting 
would likely result in two clusters which would both be deleted. There- 
fore, an additional requirement for splitting the j-th cluster is that 
the number of elements of the cluster be ^(2T^ +"1). The final addi- 
tional condition necessary for splitting is that the present number of 
clusters be less than the maximum number allowed, UL. In summary, if 
the j-th cluster has not been split in this iteration, the necessary 
conditions for splitting the j-th cluster are 
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D. > T 
3 s 


tij > (2T^ + 1) 


(2.5) 


m < UL 

where is the number of elements in ^ and m is the present number of 

clusters. An exception to these conditions occurs if m < LLj in which 

case the splitting occurs even if D. T . The number of clusters can 

3 s 

be reduced below LL by the deleting step. 

A discussion of various intracluster distance measures can be found 
in [24] and [25]. A very simple distance measure based on the variance 
per band is used in the ACA. Let ^ * * * * 

vector of standard deviations for each band of the j-th cluster. Then 
the intracluster distance for the j-th clt^ter is defined by 


D- ® Max a? 
i u k 


( 2 . 6 ) 


where k *» 1, 2, . . . , d and j =* 1, 2, . 


m. 


Let k' be the value of k which maximizes o^. Then if splitting 

conditions in (2,5) are satisfied, the j-th cluster center is split 

11 12 

into two new cluster centers, and ^ defined by 




C2» • 


- at 


I J • • * > 


4) 


and 


(2.7) 


4’ • 


J 

'k' 


, C^f + CTj^, , . 


• > 
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This simply corresponds to splitting into two cluster centers placed 
a plus and minus standard deviation from the original mean and along the 
axis of greatest variance. 

The combining of clusters in the ACA involves first finding the 
intercluster distance measure for all cluster pairs. Then cluster 
pairs are combined in the order of minimum distance first, provided the 
distance is less than T^ and there are presently more than LL clusters. 
However, clusters already combined on this algorithm pass are not com- 
bined further. Assume the i-th and j-th cluster are to be combined into 
an i’-th cluster. Then the i’-th cluster center is simply defined by a 
weighted average, or 


, n C + n.C*^ 

0 ^ . 3 - 


n. + n. 
i J 


Usual characteristics of are 

“ij i “ 


D - . = D.. 
±3 Ji 


( 2 . 8 ) 


( 2 . 9 ) 


for i and j any integer from 1 through m. Thus the number of unique 
intercluster distances, N(m), to calculate for m clusters is given by 


N(m) 


m(m - 1) 
2 


( 2 . 10 ) 


Tlie intercluster distance measure used in the ACA is defined by 

v2 


'ij 


d 

E 

k=i 


k - 4 )‘ 


44 


( 2 . 11 ) 
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A more simple measure of distance between clusters would result from 

using the cluster means, for example D,. = D (C^, C^). The distance 

ij e 

measure in (2,11) is better in- that it weights' the" distance between 
cluster means in each band relative to the spread of the clusters in 
that band. Typically the intercluster distance measures used in data 
interpretation are more sophisticated than the one in (2.11), For 
example, the intercluster distance measure used in NSCLAS is the Swain- 
Fu distance, which uses all the covariance characteristics of both 
clusters in computing the distance [l2]. Many of these general inter- 
cluster distances are discussed in These other measures could 

often give a more accurate measure of intercluster distance, but they 
would also require considerably more computation. The emphasis in this 
preliminary study of adaptive clustering for data compression use is on 
clustering approaches amenable to high data rate implementation. From 
this standpoint it is desirable to simplify rather than increase 
complexity relative to the measure in (2.11). Furthermore, initial 
investigations suggest that more sophisticated clustering results in 
only slight benefits for data compression purposes. The BCA and ACA 
defined in this section provide examples of practical clustering 
approaches which can be used for data compression. The next section 
discusses other considerations in making clustering more practical for 
data compression. 

C, Clustering of Spatially Local Sources 

Clustering for data interpretation generally involves a clustering 
of an image into twenty or fewer groups. However, if a Cluster 
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Compression Algorithm is to be useful for image approximation, it must 
be able to cluster an image (e.g., 256 ^ 256 picture elements) into 
perhaps thousands of clusters to adequately preserve image quality. 

One cannot simply obtain thousands of clusters per image as might 
be needed in image approximation, by performing the clustering jointly 
for thousands of clusters. The large number of clusters would Increase 
the computation time beyond practical limits. However, one can cluster 
the vectors in very small array sizes (e.g., 8x8). Then even though 
there is a small average number of clusters per area, say 4 per 8x8, 
there would still be a large number (4096) of clusters per 256 x 256. 
Other advantages of clustering on such a small scale are discussed 
below. 

Define P„(X) as a global sample distribution function in spectral 

o 

intensity space obtained by sampling measurement vectors over some 
spatially large source g, for example, 2340 x 3300 picture elements as 
in a Landsat image. expressed as a mixture of local sample 

distribution functions obtained from sampling measurement vectors 

within small spatially local sources, (e.g., 8x8) giving 



( 2 . 12 ) 


where N« and N are the number of picture elements in i and g respec- 
* S 

tively. Sources g and St, are depicted in Fig. 2.5(a) and typical distri- 
butions for Pg(X) and are shown in Fig. 2.5(b). Observe that it 

is reasonable to expect a low average number of classes of interest per 
local source relative to the total number of classes within the global 
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Fig. 2.5. Global vs. Local Sources, (a) Spatially, 
(b) Distributions, (c) Conditional 
Distributions . 


source. Thus, clustering at the local source level is initially easier 
to implement because there are fewer classes and fewer vectors to 
cluster simultaneously. 

1 2 

Consider some local source with classes V and V . Typical 
Pg(X/vS, P^^(X/V^) and P^*(X/V^) are shown in Fig. 2.5(c), 
The larger spread in each global conditional distribution is due to 
averaging over many local conditional distributions, each with corre- 
sponding various perturbations due to class variance, moisture, atmos- 
phere, etc. Therefore, at the local scale the vectors of a given class 
tend to form tighter clusters which are more separable. 

The greater separability of classes within a local source would 
tend to permit the use of simpler clustering algorithms while still 
preserving separability. Thus clustering at a spatially local source 
level is not only important for practical implementation reasons, but 
it is also an advantage from the standpoint of clustering quality. 
Experimental results relating local source size and clustering quality 
are contained in Chapter VI. 
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CHAPTER III 

UNCODED CLUSTER COMPRESSION ALGORITHM 


A. Spectral Features and Spectral Data Rate 

The most basic form of the Cluster Compression Algorithm is the 
Uncoded CCA as shown in Fig, 3.1. Let ^ of n measure- 

ment vectors from one local source out of a sequence of spatially local 
sources* The local sources might consist of sequences of multispectral 
arrays obtained most conveniently from a single scan swath from the 
image sensor. Each local source is then clustered into m clusters to 
extract spectral intensity features, where m is assumed constant for an 
image and represents the only user supervision input. In this study of 
the Uncoded CCA the clustering used is the Basic Clustering Algorithm 
defined in Chapter II. The features which can be extracted from 

by clustering would typically be a subset of the following: 
cluster means, cluster variances per band. 



GLOBAL IMAGE 


Fig. 3,1. A Block Diagram of the Uncoded CCA. 
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cluster covariances, number of vectors in each cluster, set of inter- 
cluster distances, etc. These types of cluster parameters comprise a 

compact grouping description of {X^}? Let_{]p^}^ be the set of m 

- . > - - - - — 

cltister features which describe the sequence of n d-dimensional measure- 
ment vectors. Assume each requires f bits of quantization. For 

■j* 

example, in image approximation might be the j*-th cluster mean with 

i* 

f equal to 6»d bits, or in classification uses might be the j*-th 
cluster mean and variance with f equal to 9»d bits if the variance 
per band is defined to 3 bit resolution. The spectral rate R is 
defined as the bits per picture element per band (bpppb) needed to 
define the spectral characteristic of or 


spec n*d 


(3.1) 


A graph of R vs. m for various values of n is shown in Fig. 3.2 with 
spec 

f = 24 and d = 4, which corresponds to using only cluster means as 
features. In some cases this description of may be all that 

is needed, and R then represents the total rate in bpppb. 

5 ^ 



2 4 8 16 


m 

Fig. 3.2 Graph of Spectral Rate per Band vs. Nuoiber of Features 
for Various n and With f =* 24 and d “ 4. 
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B. Spatial Features and Spatial Data Rate 

In many applications It Is desired to further approximate 

by approximating each vector Individually In terms of the cluster 

1 * 1 * 
features. For example. If vector X belongs to cluster j*, then X 

could be approximated by which might be simply the mean of cluster 

j*. Alternately, X^ could be approximated by a combination of 

Refer again to Fig. 3.1, Assume that after clustering the local source, 

each cluster is uniquely labeled by an integer from 1 through m. Then 

map the sequence of data vectors into a sequence of integers 

hy replacing each vector with the integer label of the cluster 

to which it belongs. This sequence of integers is called the local 

feature map and an example of a feature map is shown in Fig. 3.3, 

The feature map sequence defines the approximation of each 

measurement vector and thus gives spatial definition for the sequence 

{X^}j_,, Let represent the rate in bpppb for representing each 

element of the m-ary feature map sequence with natural coding. Then 

where 

fxl = (smallest integer ^x), • (3.3) 

This representation of the feature map is wasteful if m in (3.2) is not 
a power of two (e.g., the use of 3 bits to specify one of 5 integers for 
m=5). The representation can be generally improved by using the simple 
fixed rate natural coding on the 1-th extension (1-tuples) of the 
source. The spatial rate for natural coding of the 1-th source exten- 
sion is given by 
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Ct “ k Clog/l) . 


(3.4) 


From (3.3) we obtain 


log„m^ £ riog«m^ < log«m^ + 1 , 


-1 * I ^ J-W52" 

and from (3.4) and (3.5) there results 

i log^m i < i log^m + ^ 

Therefore, from (3.6) we have 

a 1 

R ^ > T log m 
spat •" d 2 

for all and 


(3.5) 


(3.6) 


(3.7) 


lim R 
Z-*<=o 


spat d * 


(3.8) 


Now define R ^ as 
spat 


R * = “7 log.m . 
spat d 2 


(3.9) 


From (3.7) it is observed that the best we can do with simple natural 


coding is ^gpg^j.* from (3.6) and (3.8) it is observed that for any m, 
R can be approached by coding higher source extensions. Of course, 

Suo w 

even natural coding of high source extensions is impractical, but con- 
sider now the use of the 3rd source extension which is easy to implement 

3 

for the small values of m of interest. Let R ^ be the rate achieved 

spat 

when groups of three of the original m alphabet source are represented 

3 

by natural coding. Figure 3.4 shows R and R vs. m for two 

SpSLt Sp3.c 

3 

values of d and shows that is never significantly greater than 

^spat m ^ 16. Thus the fixed spatial data rate for the Uncoded CCA 

3 

is R where 

spat 
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\pat 3d 


(3.10) 


Further reductions in the spatial data rate can be obtained through 
variable-length encoding techniques which will be considered in 
. Chapter IV. 


33331111222 
33311121122 
44331111122 
44121 1I2I43 


Fig, 3.3 Figure of a 

Typical Feature 
map . 



3 

Fig. 3.4. Graph of R ^ and R 
° spat spat 


vs. m for d=l and d*4 . 


C. Uncoded CCA Examples and Total Data Rates 

The Uncoded CCA of Fig. 3.1 compresses all local sources of an image 
to a constant data rate, dependent on the user supervision Inputs of m, 
n, f, and d for the image. The output is in two parts: 1) the cluster 

definitions defining the features extracted in spectral intensity space, 
and 2) an array of scalar numbers giving spatial definition to the 
occurrence of features in spectral intensity space. The total output 
data rate, consists in general of both spectral and spatial defi- 

nitions defined in (3.1) and (3.10). Thus 
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R ^ = R + R-' ^ ^ 

tot spec spat n»d 3*d 


(3.11) 


In some situations it is only necessary to know the group characteris- 
tics (e.g., what is there, and how much of each) for each local source, 

with no need to define each individual picture element. Then the fea- 

3 

ture map is not sent and R ^ ^ 0 and R = R . Similarly, there 

might be situations in which only spatial characteristics in the feature 

map are of interest and R could be zero. There is a great degree 

spec 

of flexibility in choosing the relative division of between 


and R 


1. Example for Image Approximation Use 

Consider an example of applying the Uncoded CCA to obtain compres- 
sion in image approximation. Assume that 8 clusters are obtained for 
each local source of 16 x 16 picture elements of 4 bands each. Also 


assume that for each cluster the user is sent 24 bits for the cluster 


mean. This corresponds to m=8, f=24, n=256, and from (3.1), ^•gpg^. ” 
.1875 bpppb. The .1875 bpppb spectral definition defines the centroids 
of 8 groups of data vectors within each local source of 16 x 16 picture 
elements. Now assume the user is also sent the feature map sequence 
consisting of 16 x 16 three bit codewords defining a spatial definition 


of the local source in terms of the 8 cluster mean features. From 

3 

(3.10), R “ .75 bpppb. The total data rate for sending the spectral 

SpcLtl 

and spatial definition (compressed image data) is = .9375 bpppb. 

As shown in Fig. 3.5(a) the user reconstructs the approximate image 


in an extremely simple manner as follows: 1) for the given local source 
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the 8 cluster centroids are placed In a sioall look up table; 2) the fea- 
ture map codewords are used as addresses to access the look up table and 
provide the proper cluster mean for approximating each picture element 
in the local source. Thus the coa?>ressed image data is very easily 
decoded into an image by the user. Furthermore, the Uncoded CCA should 
provide excellent image compression performance due to the sophisti- 
cated multidimensional quantization of spectral space provided by 
clustering. Reconstructed image examples resulting from simulations 
of the Uncoded CCA are presented in Chapter VI. 

2. Example for Supervised Classification Use 

Now consider an example where the user desires a supervised classi- 
fication result for each picture element in the image. Again assume 
that 8 clusters are obtained for each local source of 16 x 16 picture 

elements in 4 bands, and that the same m, f, and n parameters of the 

3 

previous exan^le are used. Thus as before = .1875, R = .75, 

and R *. = .9375. In fact, the exact same coumressed data is sent by 
tot 

the compressor in both examples. However, this example differs in that 
the compressed data is to be interpreted into classified picture ele- 
ments rather than an image approximation. As shown in Fig. 3.5(b) the 
user obtains the classified picture elements as follows: 1) for a given 

local source, the 8 cluster means are placed in a small look-up table; 

2) each of the 8 cluster means are classified according to the users' 
desired classification method, and the resulting user class label is 
also stored in a small look-up table; 3) the feature map codewords are 
used as addresses to access the look-up table and provide for each pic- 
ture element the user class label of the cluster to which the picture 


element belongs 
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Thus the compressed data is also very easily decoded for classifica- 
tion use. In fact for each 16 x 16 local source only 8 representative 

vectors needed to be classified -instead of 256," thereby reducing by a 

< 

factor of 32 the user classification computation. Observe also that the 
user could elect to use feature maps only for those local sources in 
which a class of interest occurs. The Uncoded CCA should perform well 
in terms of rate vs. classification accuracy since the clustering tends 
to preserve separability of the data. Basically, for whatever number 
of user classes are desired in the entire image, if m and n are such 
that more than m of these user classes seldom occur within a single n 
picture element subset, then classification performance from the com- 
pressed data is expected to be good. Examples of classification accu- 
racy vs. total data rate resulting from simulation studies are pre- 
sented in Chapter VI. 

3. Exan?)le for Feature Extraction Use 

Next consider an example where it is not necessary to define each 

individual picture element and where K can be zero. For example, 

sp^n 

in agricultural classification it may be adequate to know what crops 
are present and in what percentages in each local source. Again assxnne 
8 clusters are used per 16 x 16 picture elements of 4 bands each. As 
in the previous examples, for each cluster there is sent to the user 
24 bits for the cluster mean, but in addition 8 bits are sent to iden- 
tify the number of vectors in the cluster. This corresponds to m=8, 
f=32, n=»256, and Rgpgj, “ *25 bpppb. The per picture element defi- 
nition is not needed, so “ 0* Thus for a total rate of .25 bpppb 
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the user receives the identification of centroids and the number of 
associated data vectors for 8 groups of vectors in spectral space for 
each local source of 16 x 16 picture elements in the image. Fig- 
ure 3.5(c) shows how the user can then convert this con^ressed data 
directly to crop information by classifying the received cluster means 
and calculating percentages from the received data identifying the 
number of data vectors assigned to each cluster. This example 
demonstrates the potential for the CCA to obtain very high compression 
ratios and simple decoding for interpretation, through the extraction 
of group features. 

4. Example for a Combination of Uses 

The previous three examples demonstrated the relative simplicity in 
the Uncoded CCA for user distribution, access, and interpretation for 
three different applications. Now consider an example in which all 
three of the above data uses are simultaneously desired from the same 
image data, but by three different users. As in the immediately pre- 
ceding example an R of .25 bpppb is \ised to define 8 cluster means 

Sp6C 

and the associated number of vectors in each for every 16 x 16 local 

3 

source. Also an of *75 bpppb is used to again define each pic- 

ture element in terms of the 8 cluster means. Thus the total compressed 
data rate is R,..,. ® 1*0 bpppb. The operations of the Uncoded CCA for 
this example are depicted in Fig. 3.6. Efficient distribution of the 
compressed data is accomplished by 1) sending only the spectral data at 
.25 bpppb to the user desiring crop percentage information, and 2) 
sending the total data at 1.0 bpppb to the other two users. Each user 
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then accesses and interprets his data in the same direct fashion as 
in the corresponding previous examples. 

Figure 3.5 shows the. s.tructure -for -decoding and interpreting the 
compressed data for any of the above example applications. The basic 
look-up table operations cotild of course be done at very high data rates 
by a minicomputer. These examples demonstrated the following impor- 
tant attributes of the CCA approach; I) the same data coiq)ressor is 
used for image approximation and classification applications, 2) group 
features and/or per picture element definition can be selectively 
accessed for distribution and interpretation directly from the com- 
pressed data without first reconstructing an approximation of the origi- 
nal image, 3) the compressed data decoding operations required for 
classification and image approximation are very simple and can be done 
at high data rates with readily available equipment, 4) the computation 
required for user classification is substantially reduced due to the CCA 
compression (e.g., 1/32 as many classifications). Actual simulation 
results of image approximation and classification performance are 
presented in Chapter VI. 
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CHAPTER IV 

CODED CLUSTER C014PRESSI0N ALGORITIM 

A. Feature Map Source Ilodels and Perfonaance Bounds 

A typical feature map sequence for described by four 

cluster features is shown in Fig. 3.3. In image data there is signifi- 
cant spectral correlation between spatially close picture elements, 
and correspondingly between adjacent elements in the feature map. 
Entropy encoding techniques can use this spatxal correlation to rep- 
resent feature map sequences with a lower average spatial data rate 
than given in (3 •9).- Furthermore, this reduction in spatial data 

rate is obtained without any degradation to the feature map sequence. 

The performance of an entropy encoder is usually compared to the 
best performance possible for encoding a data source which statisti- 
cally models the actual sequences to be encoded. Basically, a source 
model is defined such that the observed feature map sequences are 
typical sample sequences from the source model. Then tne entropy of 
the source model is used to bound the lowest possible average data rate 
for encoding sequences from the source model [9], [293- An entropy 
encoder attempts to achieve performance close to the entropy of the 
source model for the feature map sequences. Two key elements to 
entropy encoding are 1) the determination of a model for the source 
which has low entropy, and 2) the determination of a practical encoder 
which can encode sequences from the source model at an average rate 
near the entropy. The source model will have lower entropy if the 
model incorporates more memory and nonstationary statistics. However, 
the encoder also tends to become impractical when it attempts to make 
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use of source memory, or adapt to nonstationary statistics. In this 
section some source models and associated entropies are defined for the 
feature map sequences. Then a practical entropy encp.ding technique i-s 
defined, and simulation results used to demonstrate the actual coding 
performance relative to the entropy bounds for several source models. 

1. Bound for Nonadaptive, Zero-Memory Encoder of Feature Map 
Differences 

The first source model is similar to the very common and simple 
approach of representing the image data in temas of differences between 
adjacent picture elements. Of course, the feature map can be repre- 
sented in terms of differences with no loss in information. The 
differences between spatially adjacent elements of a feature map are 
typically distributed as shown in Fig. 4.1. For image data, there is 
little correlation between the differences of adjacent elements, tnus 
these differences are often modeled as samples from a stationary, 
zero-memory source. We now similarly define the analogous source model 



Fig. 4.1. Typical Distribution of Differences 
Within a Feature Map . 
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and entropy for the differences between adjacent elements of a feature 
map* 

A difference in a feature inap is an integer 6, where S - -(m-1) 
through (m-1) . Let represent the relative frequency of occurrence 
of the integer 6 over all feature maps of the image. Define S as a 
source which produces integer outputs 6 with probability P^. Then S is 
a stationary, zero-memory source model for producing sample sequences 
of feature map differences. The entropy in bits for the source S is 
defined by 


(m-1) 

H(S) = S " ^6 
6=-(m-l) 

H(S) represents a performance bound (lowest possible rate) for encoding 
differences from the feature maps with any nonadaptive, zero-memory 
encoder* 

2* Bound for Adaptive, Zero-Memory Encoder of Feature Map 
Differences 

Image data and the corresponding feature map data are typically 
nonstationary. For exan?>le, the set of frequency of occurrences 

obtained from differences in the A-th feature map of an 
image may be quite different than the frequency of occurrences 

of differences obtained from averaging over all feature 
maps in the image* A source model for the feature maps will have less 
entropy if it also models the nonstationarity in the source* Simi- 
larly, for nonstationary sources a better performance can be obtained 
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from an entropy encoder which adapts to the statistical characteristics 

on a local basis. Therefore, we next define a nonstationary source 

model for the feature ^p sequences. The entropy of this nonstationary 

model serves as a useful bound on the performance of an adaptive 

entropy encoder. 

0 

Define S as a stationary zero-memory source model for the sample 
sequence of differences from the A-th feature map in the image, where 
A =» 1,2,««»,N^ and is the number of feature maps in the image. For 

a 

each il, S is defined to ouptut an integer 5 with probability where 
£ 

Pg is the frequency of occurrence of the difference between adjacent 
feature map elements being 6 within the 5.-th feature map, and 6 is an 
integer from -(m-1) through (m-1) . Then let be the nonstation- 
ary zero-memory source which is constructed from the many different 
stationary sources. For each local source in the image, the output of 
is a sample sequence from a different stationary source. The entropy 
in bits for each S is given by 

(m-i) 

6=-^(m-l) 

Averaging the entropies in (4.2) over all local sources in the image 
results in an average local entropy, or the entropy for S,, , 

Lt 

“'V . C4.3) 


46 



77-43 


H(S^) is a performance bound on the average rate obtained by a 
zero-memory encoder which adaptively encodes each feature map sequence 
of differences. Actually, represents an unachievable performance 

boimd for practical encoding, since we ignored the overhead rate 
required to inform the decoder of the new statistics relative to each 
local sequence. It can be shown that HCS^) < H(S) [9, pp. 27-28]. 

3. Bound for Nonadaptive and Adaptive Zero-Memory Encoder of 
Feature Hap Distance Rank Symbols 

Recall that the feature map resulted from replacing each picture 

element with the integer label of the cluster to which it belonged. In 

the above modeling of the feature map differences and the corresponding 
£ 

entropy H(S ) in (4.2), it was implicitly assumed that the integer 
labels assigned to the feature map clusters resulted from the estab- 
lishment of labeled initial mode centers in the clustering algorithm 
defined by (2.4) . However, the clusters corresponding to any feature 
map can certa inl y be assigned any set of distinct labels, and the 
entropy of differences in the feature map is not necessarily minimized 
by using the cltister labels from the initial mode center definition. 
Intuitively, the entropy of differences is reduced when the set of 
frequency of occurrences for differences become less uniform in value, 
for example, the small magnitude differences occur more frequently and 
the large magnitude differences less frequently. In monochromic 
image data the differences consistently tend to lower magnitude, based 
on the underlying observation that the next image element tends more 
probably to be similar to the preceding. Similarly, for mtiltispectral 
images, an image element tends more probably to be similar (in a 
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multidimensional sense) to the preceding. However, for multispectral 
data, it is often impossible to define a set of cluster labels such 
that the magnitude of the difference between each pair of cluster 
labels is monotonically related to the multidimensional intercluster 
distance , (e.g. , Euclidean distance between cluster means) between each 
pair of clusters. For example, consider the four clusters in two 
bands depicted in Fig, 4,2. The difference symbol resulting for each 
pair of ad 3 acent elements of the feature map is shown in Table I for 
the arbxtrarily chosen cluster labels. Table II is constructed by 
assigning an ordering to the multidimensional intercluster distances 
between each, pair of clusters. For cluster 2, for example, the 
clusters when ordered as closest first are 2, 3, 4, 1, and the transi- 
tions from cluster 2 to these clusters are labeled in the table 
respectively as 1, 2, 3, 4, Observe that the magnitude of the differ- 
ences in Table I are not consistently monotonically related to the 
distance ordering in Table II. Furthermore, no assignment of cluster 
labels can achieve such a relationship. Observe also that the differ- 
ence representation of the source required a source alphabet of seven 



Fig. 4.2. Four Integer Labeled Clusters. 
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TABLE I 

DIFFERENCE OUTPUTS FOR PAIRS OF SOURCE SYMBOLS 


i-th Source Symbol 


(i+l)-th Source Symbol 



1 

2 

3 

4 

1 

0 

1 

2 

3 

2 

-1 

0 

1 

2 

3 

-2 

-1 

0 

1 

4 

-3 

-2 

-1 

0 


TABLE II 

DISTANCE RANK OUTPUTS FOR PAIRS OF SOURCE SYMBOLS 


i-th Source Symbol 


(1+1) -th Source 

Symbol 


1 

2 

3 

4 

1 

1 

2 

3 

4 

2 

4 

1 

2 

3 

3 

4 

2 

J 

3 

4 

4 

2 

3 

1 

symbols as -opposed to four 

symbols 

used in Table II. 

Therefore, 

the 


representation of the source in termtf of differences becomes less 


appropriate for multispectral imaging. We next define a new represen- 
tation of the-feature map which incorporates the multidimensional nature 
of the source. This representation will be generally lower in entropy 
and independent of the cluster labeling. The difference representation 
of the source is still worthy of note because of its greater simplicity, 

and also because its entropy may often be not significantly higher. 

a 

Define SS as a new source model for the Jl-th feature map. Assume 
the A-th feature map has m clusters with any arbitrary but distinct 

I 

set of cluster labels. Then the output S3rmbol from SS for each fea- 
ture map element is an integer r,r = !,•••, m with r determined as 
follows: if the cluster corresponding to the present feature map ele- 

ment is the i— th closest of all clusters to the cluster corresponding 
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to the previous feature map element, then r =* i. Closeness is measured 
here in terms of the simple intercluster distance, the Euclidean dis- 
tance between cluste^r centers. - Any predetermined ordering can be used 
for equally close clusters. Refer again to the example of four clus- 
ters in Fig. 4.2. Table II defines the mapping between feature map 
transitions and output symbols for SS . For example, when the previous 
feature map element is a 2, the output symbol from SS is either 4, 1, 

2, or 3 dependent on whether the next feature map element is a 1, 2, 3, 

■ Z 

or 4 respectively. Note that a symbol of SS is consistently smaller 

in value when the adjacent feature map elements are more smilar. This 

generally results in a more nonuniform distribution of symbol values 

Z 

and a lower entropy. The distance rank representation SS is more 

Z Z 

generally applicable than the difference representation S in that SS 

possesses the above desirable traits for any cluster labeling and any 

dimensionality of the cluster space. The conversion between feature 

£ 

map elements and SS symbols is specified by an m x m distance rank 

matrix. Therefore, this source representation requires the additional 

computation of the distance rank matrix at the source and at the 

destination for each feature map. However, no overhead rate is required 

to inform the destination of the distance rank matrix, since this matrix 

can already be calculated at the destination from the available spectral 

data which defined the clusters of the feature map. 

Let SS be a source model for the sequence of distance rank sym- 

Z 

bols from the i.-th feature map. Every symbol in a sequence from SS is 

Z Z 

defined to have probability P of being r, where P is the frequency of 

r IT 


50 



77-43 


occurrence of the value r in the distance rank representation of the 
Z-th feature map. Then the entropy for SS is 

H(SS^) = 22 - pj I 0 S 2 '^x » 
r=l 


and the average local entropy for the nonstationary model of feature 
map distance rank symbols SS^ is 


N, 


H(SSj^) 


JL V' 

■ 


H(SS^) 




( 4 . 5 ) 


Now let be the frequency of occurrence for the distance rank sym- 
bol r averaged over all feature maps in the Image. The stationary 
model of feature map distance rank symbols is defined as SS and it 
produces sequences for which the probability of each S 3 nnbol being r is 
P^, The entropy of SS is 

ta 

H(SS) *’22" ^r * 

3r*l 

H(SS) and H(SSj^) represent performance bounds for a nonadaptive and 
adaptive encoder respectively which encodes sequences of distance rank 
symbols from the stationary and nonstationary models of the feature 
maps. 
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4. Bound for Adaptive, First-Order Memory Encoder of Feature 
Map Symbols 

Before comparing entropies of the above practical source models^, 
we consider a nonstationary first order Markov source model SSS^^ for 
the feature maps. This source model is of less practical value than 
the preceding source models for the following reasons: use of the 

additional memory in the' model would complicate the encoder, and the 
overhead rate and complexity associated with informing the decoder of 
the new statistics for each feature map becomes substantial. There- 
fore, SSSj^ is used here only to provide a performance bound which spec- 
ifies the limits to the performance improvements which could be 
obtained from the next level of modeling complexity. 

< 

In a first order Markov source the probability of emitting a given 
source symbol depends only on the previous symbol. If there are 
m source symbols then there are m states- for the source. A state dia- 
gram is used to depict a first order Markov source for three symbols in 
Fig. 4.3. In general, a first order Markov source is completely speci- 
fied by giving the source alphabet and the conditional proba- 

bilities pCa^/a^) of getting symbol a^ after a^ for j = 1, 2, . . . , m 



Fig. 4.3., State Diagram for a First Order Markov Source- 
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and k*»l, 2, . . , ,ia. The state probabilities p(a^) for j ® 1> 2, 

. . . , m can be calculated from the conditional source probabili- 
ities [30, pp. 338-376]. We can model the feature maps as sequences 
from a nonstationary first order Markov source as follows. Define the 
source alphabet as the integers which are used to label the clusters, 

a^ = j for j=l, 2, . . . ,m. (4.7) 

Also define the conditional probabilities of the i-th feature map by 

P^(a^/a^ « p^(j/k) (4.8) 

where p (j/k) is the frequency of occurrence of an integer j following 

2 

an integer k in the 2-th feature map. Let SSS denote the stationary 

first order Markov source model for the 2-th feature nap sequence 

- £ 

as defined by (4.7) and (4.8). Then the entropy of SSS is defined as 

m 

H(SSsS P^(k)H(SSsVk) , C4.91 

k=l 

2 2 

where H(SSS /k) is the conditional entropy of SSS while in the k-th 

state. 


m 

H(SSsV) “2 “ P^(3/k) log p^a/k) C4.10) 

d-1 
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for k = 1, 2, . . . , m. Averaging the entropies in (4.9) over all 
feature maps in the image gives the average entropy for the nonstation- 
ary source model SSS^, 

H(SSSj^) = ^ ^ H(SSS^) . (4.11) 

H(SSS ) represents a performance bound for an entropy encoder which 
Xj 

adapts to each sample sequence from the nonstationary first order 
Markov source model of feature maps. Actually, H(SSSj^) represents an 
unachievable performance bound, since H(SSSj^) does not include the rate 
required to inform the decoder of the relevant source statistics for 
each local source, which is required for decoding. However, this bound 
is useful since it does place a limit on possible performance improve- 
ment from using the full first order memory of the source. 

I 

B. Simulation Results Comparing the Performance Bounds 

In Chapter III the Uncoded Cluster Compression Algorithm was shown 

3 

to have a spatial data rate. (bpppb) of R ^ as defined in (3.10). It 

spat 

3 

was also shown that R ^ was approximately the same as R ^ = 

spat spat 

(1/d) log 2 ®> where m is the number of clusters in the d dimensional 
measurement space. The lower spatial data rate due to entropy coding is 
bounded by the entropy per band, 

H^(^)=S^, (4.12) 
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where SP is one of the source models, S, S , SS , SS, SSS_ , and where 

L Xi Xj 

H(^) is defined by (4.X), (4.3), (4.5), (4.6) and (4.11) respectively. 
The entropy per band depends directly on m, the nuniber of clusters per 
local source, and also upon the test image used which determines the 
frequency of occurrences for sjTobols of the source model. Entropy per 
band also depends Indirectly upon the size of each feature map source 
n, which impacts the nonstationarity of the source, and the set of 
frequency of occurrences. The performance bounds for each of the source 
models is plotted in Fig. 4.4 for various values of m and n. The bounds 
were obtained in each case from using the appropriate feature map source 
models for all the spatially disjoint square subsets in the test image 
with n picture elements. For example, HCSj^) for m = 8 and n = 256 is 
determined as follows: Each 16 16 subset of picture elements of the 



Fig. 4.4 Performance Bounds from the Various Source Model 
Entropies as a Function of m and n. 
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test image is clustered to 8 clusters, then the frequency of occurrences 
for differences within each 16 >= 16 feature map is calculated, and the 
corresponding H(S^) and H(Sj^) result from (’4.2) and (4.3). The test 
image is shown in Fig. 6.8(a) and is a 256 x 256 picture element subset 
of a four band multispectral image of the Verde Valley, Arizona, taken 
by the Landsat I satellite. H(^) was also determined for a wide range 
of other test images and only slight differences from the H(,S^) for the 
test image in Fig. 6.8(a) were observed. 

For any source 9 ^ let e^(^) be the ratio of entropy per band to 
the uncoded rate per band. 








(4. 13) 


The following information concerning feature map entropy coding can be 
observed from the typical performance bounds shown in Fig. 4.4. The 
ratio ranges mainly between 0.6 to 0.9 with a typical value of . 

0.75. For a given n, Sjj(«^) reduces as n increases. Similarly, for a 
given n, reduces as m decreases. In each case the decrease in 

8j^(^) is due to the decrease in the ratio of clusters to picture 
elements. A decrease in this ratio corresponds to a coarser approxi- 
mation of the image, which in turn causes a smoother contouring in the 
feature map and lower entropy. Incorporating the nonstationary charac- 
teristics of the source into the source model results in reducing 
by about 5 to 15 percent. The difference between the simplest source 
model S and the most complicated practical source model SS, is typically 

Li 

15 percent. Furthermore, the use of the clearly unachievable bound 


56 



77-43 


H,(SSS ) results in less than 10 percent improvement in e, (^) relative 
d L a 

to the use of In summary, the performance bounds for the 

practical source models indicate that entropy encoding could possibly 
reduce the spatial data by factors approaching 0.8 to 0.65 dependent 
on source modeling complexity. However, a reduction in spatial data 
rate by a factor less than 0.55 is not possible even with a considerably 
more complicated and less practical source model. Therefore, the 
practical encoder defined next will be applied only to the source 
models S_ and SS . 

i-> Xi 

C. Definition of Practical Entropy Encoder 

The chosen practical entropy encoder is an adaptive variable 
length encoder described in [31] and briefly defined below. An 
adaptive encoder is chosen to enable encoding of widely different types 
of image data. Once one can adapt to statistics of different images, 
it is a simple extension to be adaptive at a more local scale within 
the image. The encoder also includes the ability to represent 3- tuples 
of the source, thereby enabling average encoded data rates below one 
bit per element of the feature map. Of course, an adaptive Huffman 
encoder for the third extension of the source could be used, but such 
an encoder would be much less practical to implement. The adaptive 
variable length encoder defined below is shown to be easy to implement 
and performs near the previously discussed entropy bounds over a large 
range of source model entropies. Basically, the procedure of the 
adaptive encoder is to encode each code block with four different 
methods and use the method which results In fewest bits. Each encoding 
method is capable of good performance on typical sample sequences from 
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a source with an. entropy within a specific and limited entropy range. 

By using the best of all four methods for each code block, good encod- 
ing performance is obtained for sources with entropies over the„ full 
range of interest. 

The functions of the practical entropy encoder are diagramed in 
Fig. 4.5. The first element of each feature map sequence 
left uncoded and used as a decoding start reference. The rest of the 
feature map elements are segmented into code blocks of length J, where 
J is typically fixed between 16 to 32, except for the last code block 
which may be smaller. Each of these code blocks is then encoded with 
one of four methods and the method used is identified for each code 
block by using two overhead bits. A J between 16 to 32 represents a 
good performance compromise beti^een adaptability and overhead costs. 

The first coding method is simply the natural coding of 3-tuples. This 
method serves as a backup coding procedure which limits the coded data 
rate to a rate_only slightly higher than the uncoded data rate, 
because of the overhead bits. Furthermore, this encoding method is 
efficient for code blocks for which the symbols are nearly uniformly 
distributed. The results of this coding for the code block is stored 
in Buffer 1, 

The other three coding methods require that the code block 
sequence be represented in terms of either differences between adjacent 
elements, or as a sequence of distance rank symbols as previously 

a a 

discussed in the source representations for the S and SS source 
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FEATUSE MAP SEQUENCE 


lyU i=j 



yi CS 


DECODING START REFERENCE CODE SEQUENCE 

(ONCE PER FEATURE MAP) TWO BIT ID TO DEFINE 

MINIMUM LENGTH SEQUENCE 
OF {NQ CFS, FS, CFS) 

(ONCE PER CODE BLOCK) 


Fig. 4.5. 


Block Diagram of a Practical Entropy Encoder . 
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models respectively. If the source is modeled by S , then a Funda- 
mental Sequence FS is generated for the code block according to Table 

£ 

III. If the source is modeled by SS , then the FS is generated 
according to Table IV. The FS for the code block is collected into 
Buffer 3 and constitutes the method 3 encoding for the code block. As 
can be seen from Tables III and IV, this encoding is efficient when the 
first few symbols of Tables III and IV occur with high frequency in the 
code block. However, this method is obviously not efficient for code 
block sequences which result in many Os within FS. Neither is method 3 
efficient coding when FS is nearly all Is, since the average rate per 
symbol of the FS cannot be below X bit per symbol. Method 3 provides 
efficient coding for source models with entropies in the range of 1.5 to 
3.0 bits per symbol [3l]. 

At the same time that FS is stored into Buffer 3, it is also seg- 
mented into 3-tuples and terminated with extra zeros as needed. This 
sequence of _3M:uples is then used to generate a Coded Fundamental 
Sequence CFS according to Table V. The GFS is collected into Buffer 2 
and constitutes the method 2 encoding for the code block. From Table V 

TABLE III 

CONSTRUCTION OF FS FROM DIFFERENCES 



Difference 

FS Contribution 


0 

1 


I 

01 


-1 

001 


2 

0001 


-2 ^ 

m 

00001 

• 


• 

« 

* 

• 
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TABLE IV 


CONSTRUCTION OF FS 

FROM DISTANCE RANK SYMBOLS 


• Distance Rank Symbol 

FS Contribution 


X 

1 

2 

01 

3 

001 

4 

0001 

5 

m 

00001 

• 

• 

• 

• 

• 


it is clear that the method 2 encoding is more applicable to an FS 
with considerable runs of Os. Method 2 provides efficient coding 
for source models with entropies in the range of 3.0 bits per symbol up 
to an entropy equal to the rate of natural coding of 3-tuples in 
method 1. 

Method 4 is implemented in parallel by again segmenting FS into 
3-tuples, but terminating with Is. Then the sequence of 3- tuples are 
complemented and used to generate a Coded Fundamental Sequence Bar CFS 
by using the same Table V as used for generating CFS. The CFS sequence 
resulting from coding method 4 is stored into Buffer 4. Method 4 
provides efficient coding for long sequences of Is in the FS. This 

TABLE V 

CONSTRUCTION OF CFS FROM 3-TUPLES 




3-Tuple Input 

CFS Output 


000 

0 

001 

100 

010 

101 

100 

110 

101 

11100 

Oil 

11101 

110 

11110 

111 

mil 
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corresponds to efficient encoding for source models with entropies 

below 1.5 bits per sjnnbol and down to entropies approaching 0.5 bit per 

symbol. The details concerning- which range of entropies each coding 

method is most efficient for are discussed in [3l]. 

For the b-th code block the coder selects the coded sequence which 

has minimum length and outputs the following: 1) a two bit sequence ID 

which identifies for the decoder the method used, and 2) the corre- 

b b b ^ 

spending minimum length coded sequence, NC , CFS , FS , or CFS . This 
procedure is repeated for each code block in the feature map. In this 
way the entropy encoder is noted to adapt to the data source for 'every 
J elements. The entropy encoder output for each feature map consists 
of the decoder start reference, and a multiple number of adaptively 
coded sequences. This encoded data stream Includes all the necessary 
• overhead to enable similarly simple decoding back to the original fea- 
ture map elements. 

Let th e co ded output sequence for code block b be represented by 
CS^. Also let lCh] represent the length in bits of any binary 
sequence q. Then for the b-th code block the coded output has length 

L[CS^] = min |^l[NC^],L[CFS^],L[FS^] ,L[CFS^] j + 2. (4.14) 

The total length of all the coded output sequences for B code blocks in 
the )l-*th feature map is given by 

B 

TL^ = ^ L[CS^] . (4.15) 

b=l 

The decoder start reference for each feature map requires flog 2 
bits, where m is the number of source symbols in the feature map. 
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The averaged coded spatial data rate in bpppb for the £-th feature map 


with source representation 3^ is 


R , (.^ 

spat 


fe ll H. 11=!. 

d»n d*n 


(4.X6) 


r»S‘ a a 

where 9^ is either S , or SS . The entropy coded spatial data rate 
averaged over the feature maps of the image is 


Nt 


R , 
spat L 




A=l 


spat 


(4.17) 


In summary, R ^.(S-) is the average encoded spatial data rate 
sps.c X4 

(bpppb) resulting from entropy encoding the difference between adjacent 

Q 

feature map elements from the entire image. Similarly, R (SS_ ) 

spat L 

represents the average spatial data rate resulting from entropy 
encoding the distance rank symbols representing the feature maps. 


D. Computer Simulation of the Practical Entropy Encoder 

A computer .simulation of the entropy encoder was developed to 

demonstrate its performance on the test image in Fig. 6.8(a). The 

performance of the simulated entropy encoder is shown in Fig. 4.6 in 
c c 

terms of R ^(S_) and R ^(SS, ) as a function of m and n. The uncoded 
spat L spat L 

spatial data rate and the entropy per band for source models S^^ and 

SS, are also plotted for comparison. Recall that Hj(S_) and H,(SS, ) in 

a L d L 

Fig. 4.6 represent an unachievable bound to the performance obtainable 
by an entropy encoder which adapts to each feature map while encoding 
the nonstationary source models and SS, for the image in Fig. 6.8(a). 
The results in Fig. 4.6 are typical of results similarly obtained from 
many other multispectral images. It can be observed from Fig. 4.6 that 


63 



77-43 



Fig. 4,6. Performance of the Practical Entropy Encoder. 

the entropy encoder performs reasonably close to the corresponding 
unachievable performance bounds. Therefore, further refinement in the 
entropy encoder can result in only small improvements in performance. 
Furthermore, it was also shown earlier in Fig. 4.4 that even nonstation- 
ary Ist-order tiarkov source modeling of the data would not result in 
substantial improvement in the performance bound for entropy encoding. 
These observations indicate that additional substantial encoding 
performance gains are very imlikely to be obtained from any other 
practical encoder which is similarly constrained to permit exact recon- 
struction of the feature map. The above practical entropy encoder is 
also relatively easy to implement, since it requires simple logic and 
little memory. Note, for example, that Buffer 2 through Buffer 4 in 
Fig, 4.5 need be no larger than Buffer 1. Therefore, Buffer 1 through 
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Buffer 4 are all at most 128 bits. A serial implementation of the four 
coding methods can be used in lower data rate applications. 

In svinnnairy, the entropy encoder should provide a typical average 
data rate reduction of about 20 percent. This data reduction is 
significant, especially since no degradation of the feature map occurs. 
The entropy encoder itself was also shown to be relatively easy to 
implement at high data rates. However, the entropy encoder is a vari- 
able length encoder, and variable data rates often seriously impact the 
complexity of the rest of the data system. For example, in applica- 
tions on-board a satellite it is usually necessary to interface with a 
constant data rate transmitter. Thus the use of a variable data rate 
encoder introduces a need for considerable data buffering and data rate 
control measures between the encoder and the transmitter. The imple- 
mentation complexity of these additional data system elements must be 
considered before choosing to use any variable-length encoder. In the 
next sectionr~we consider ocher feature map coding methods which are 
not constrained to exact reconstruction of the feature map. 

E. Other Feature Map Encoding 

The above entropy encoding did not Introduce any degradation in 
representing the feature map. However, in some applications only cer- 
tain spatial characteristics of the feature map need to be preserved. 
For example, only closed boundaries in the feature map may be desired 
for field interpretation. The use of spatial texture in this applica- 
tion would allow the elimination of most irregular single elements 
within fields. When only specific spatial characteristics are of 
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interest, the feature map can in effect be represented by a sequence 
typical of a source with lower entropy. Then the following entropy , 
encoding may obtain significantly lower data rates. However, the 
texture to be preserved and the associated fidelity criterion for its 
preservation are usually very difficult to define. Correspondingly, 
encoding algorithms which incorporate spatial texture are typically 
less amenable to high data rate implementations. Perhaps the task of 
extracting spatial texture from d-dimensional image data (e.g., boundary 
finding algorithms) may be aided by first having local regions of the 
image converted to scalar numbers which represent significantly differ- 
ent spectral features. The incorporation of spatial texture into the 
feature map encoding is not pursued at this point, but appears to be a 
topic worthy of further investigation. 

An example of a very simple method for reducing the spatial data * 
rate is the deletion of subsets of the feature map with reconstruction 
by linear interpolation. This approach is a crude but simple method 
which might be used if spatial resolution is higher than necessary for 
observing texture in a particular application. The spatial data rate in 
this case is simply reduced by the feature map subsampling factor. 

F. Definition of the Coded CCA 

The block diagram for the Coded CCA is shown in Fig. 4.7. The 
Coded CCA is identical to the Uncoded CCA shown in Fig. 3.1 except for 
the addition of the feature map encoder. The feature map encoder may 
be a spatial feature extractor, an entropy encoder, or both. In this 
work, primary consideration is given to entropy encoding, and the Coded 
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CCA should be assumed to use this encoding unless otherwise specifi- 

( 2 , 

cally stated. Let represent the average rate in bpppb for the 

coded feature map sequence. Define a compress.ion ratio CR for the 
feature map encoder by 


CR 


R 


spat 


spat 


1 

3” 

log, m 

3«d 



spat 


(4.18) 


In general, the rate for the coded feature map sequence is then given by 


log2 m 


spat 


d*CR 


(4.19) 


and the total coded data rate for the Coded CCA is represented by R , 

tot 


where 



R + R^^ 
spec spat 


n*d d*CR 


(4.20) 


Equation (4.20) also applies to the Uncoded CCA when CR = 1. 

As noted for the Uncoded CCA, the total data rate R^ ^ for the 

’ tot 

Coded CCA also consists of a mixture of spectral and spatial defini- 
tions. However, for the Coded CCA the division of R^ ^ between spectral 

tot 

and spatial definition is dependent on CR in addition to m and n. Let 
o 

^ in (4.20) be a constant R , and express n as a function of m to 
tot tot 

get 


m»f 



(4.21) 
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where 



c* 

A plot of n vs. m in (4.21) for a constant R ^ = 1 is given in 

tot 

Fig. 4.8 with f =■ 24, d = 4 and CR = 1 and 2. This plot demonstrates 
that a given 8.^^^ may be approximately obtained by many combinations of 
m, n and CR. Chapter VI will compare the performance of various 
(m, n, CR) combinations which have the same 



ffl 


Fig, 4,8. Plot of n vs. m for “ 1 with f = 24, d = 4, 
and CR “ 1 and 2, ° 


69 



77-43 


CHAPTER V 

OTHER CLUSTERING FORMS OF THE CCA 

A. The Adap.tive- CCA 

In both the Uncoded CCA and the Coded CCA» it was assumed that 
the ntnnber of clusters m per local source was a constant throughout the 
entire image. However, since most image data is nonstationairy, it is 
desirable to have the ntmiber of clusters used per local source be 
variable and adaptively determined to meet the quality requirements for 
the data of each specific local source. For example, one local source 
may consist of data for which only two clusters are necessary to meet 
the quality requirements, while another local source may require ten 
clusters. Rather than use ten clusters for each local source, it is 
preferable to use only as many clusters as necessary for each local 
source. The Adaptive CCA is defined to incorporate the use of adaptive 
clustering. The advantage of the Adaptive CCA is that a lower average 
value of m can -be used and thereby the spectral and spatial data rates 
are both reduced. 

The form of the Adaptive CCA is the same as for the Coded CCA 
shown in Fig. 4.7 except the clustering is adaptive. The Adaptive 
Clustering Algorithm (ACA) described in Chapter II is used to demon- 
strate the concept of the Adaptive CCA. The ACA is adaptive by delet- 
ing, splitting, or combining clusters according to corresponding user 
supervision theresholds. The user supervision also specifies the lower 
and upper limits, on the number of clusters used per local source. 

These thresholds and limits ideally allow the user to define the 
quality requirements in terms of 1) minimum number of elements 
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(relative spatial density) required for distinct representation, 

2) maximum intracluster variance, 3) minimum intercluster distance, and 
4) the minimum and maximum niunber of clusters per local source 
(minimum, and maximum number of meaningfully different spectral 
reflectances present per local source) . These user supervision inputs 
are constants and need to be specified only once for the entire image. 

For each local source, the adaptive clustering results in a vari- 
able number of clusters. The number of clusters for each local source 
is dependent on the user supervision inputs and the data of the local 
sources. The spectral data definition still defines the cluster fea- 
tures present for the local source, only now the number of features is 
variable. Thus, the spectral data definition of each local source must 
now also include a short lead feature which indicates the mmber of 
clusters present. The feature map encoder and the coded spatial data 
rate are unchanged from the nonadaptive CCA, except that the number of 
symbols per feature map element now can change from one feature map to 
the next. The ground data selection, reconstruction, or interpretation 
proceeds as before except that the lead spectral feature is used to 
inform the decoder of the number of spectral features being transmitted 
for each local source. 

The extra complexity introduced by adaptive clustering comes from 
two sources, 1) the added complexity of the clustering algorithm, and 
2) the increased data system complexity resulting from the variable data 
rate. Refer to the block diagram of the ACA shown in Fig. 2.4, and the 
BCA shown in Fig. 2.3. Observe that both algorithms have the same 
Assign function which requires a very high computation rate of a simple 
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distance measure between vectors, for example the distance measure in 
(2.1) or (2.2). The ACA in addition requires more difficult computa- 
tion during the cluster modification phase between Assign i'tefations, 
for example the intercluster distance measure of (2.12) for combining 
clusters. However, this more involved computation can be performed at 
a lower rate than the simple Assign computation, since it is required 
only once per each complete Assign function. Thus the structure of the 
ACA is such that the more complicated computation is required at a 
greatly reduced frequency of repetition. 

The adaptive clustering results in a variable data rate and the 
same associated data system problems discussed in Chapter IV for the 
variable- rate feature map encoder. The range of the data rate is 
limited by the specification of the upper and lower bounds for the 
number of clusters. In a data system application requiring a fixed 
data rate, one should consider augmenting the adaptive clustering with 
a control structure to achieve a fixed average data rate over a volume 
of data which is compatible with practical data buffer limitations. 

This control structure would likely consist of an estimator for deter- 
mining the percentage rate allocation to be made between the local 
sources of a fixed rate block. Then the data rate allocations could be 
made through corresponding modifications of the thresholds and limits 
used for each local source. At this time the investigation is focixsed 

on the Adaptive CCA without consideration of rate control. 

A Si 

Let m and CR represent the number of clusters and the feature 
map encoding compression ratio respectively for the ^-th local source 
of the image. Let M , be the maximum number of clusters allowed in any 
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local source as specified by the user supervision. Assume the number 

of clusters per local source is identified by simply using |”lo 82 

bits at the beginning of the spectral definition data for every local 

source. Then the average adaptive spectral data rate R(A) and the 

spec 

average adaptive spatial data rate R^(A) for the Adaptive CCA are 

spat 

given by 


N. 


Spec d*n N d* 

^ Jl=X 


(5.1) 


and 


r'^CA) 

spat 


1 ^ 

N, 3d-CR^ 




(5.2) 


where II is the number of local sources in the image. The average 

Li 

adaptive total data rate R^(A) is given by 

tot 


r^(a) = RC4) + r*^(a; . 


(5.3) 


tot 


spec 


spat 


In the next chapter computer simulations are used to measure perfor- 
mance improvements due to using adaptive clustering.. 


B. The Cascaded CCA 

When a sequence of local sources is individually clustered, there 
is likely to be considerable similarity between many of the clusters. 
For example, assume a sequence of 16 local sources are clustered to 
8 clusters each. This results in 128 different clusters for the 
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sequence of 16 local sources. However, many of these 128 clusters 
resulted from image elements in different local sources, but with 
similar spectral reflectance properties. Thus many of these 128 clus- 
ters are -very similar. Usually similar clusters come from different 
local sources. For example, cluster 1 of local source 1, cluster 1 of 
local source 2, cluster 2 of local source 3, etc., could all be similar 
clusters because they might each represent similar image data in differ- 
ent local sources. However, similar clusters can also be from the same 
local source. Although the clusters within a local source are by 
design dissimilar from one another, the clusters might still be very 
similar on a relative basis when compared with all the other clusters 
resulting from a sequence of local sources. For example, assume that 
the above 128 clusters represent 16 distinctly dissimilar types of 
image data. However, assume local source 3 possesses only 2 of the 
16 dissimilar types of image data. Then many of the 8 clusters from 
local source 3 will be quite similar relative to the dissimilarities 
occurring between the other clusters. 

One method of taking advantage of these similar clusters is to use 
cascaded clustering within the CCA. Cascaded clustering basically col- 
lects the cluster descriptions for a sequence of local sources and then 
clusters these local clusters into a smaller set of cluster descrip- 
tions to serve as the cluster features for the sequence of local 
sources. For example, a sequence of 16 local sources are initially 
clustered to 8 clusters each, and the cascaded clustering combines the 
resulting 128 local clusters into 16 sequence clusters to represent the 
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sequence. Each local source Is still represented in terms of local 
cluster features, however, the local cluster features for each local 
source are now a subset of the 16 sequence clusters which represent the 
sequence of local sources. In addition, the cascaded clustering com- 
bines those clusters within a local source which are similar relative 
to the clusters present in the sequence. Thus the average number of 
clusters per local source might for example be reduced from 8 to 5. 
Cascaded clustering reduces the spectral data rate by eliminating the 
need to separately define the many similar clusters resulting from 
independent clustering of local sources. Cascaded clustering also 
reduces the spatial data rate by reducing the average number of clus- 
ters per local source through the combining of clusters in a local 
source which are similar compared to all the clusters present in the 
sequence of local sources. An inherent advantage in cascaded cluster- 
ing is the ability to modify the local source clustering based on knowl- 
edge of clustering results for surrounding local sources. 

1. Sequence Spectral Definition 

The form of the Cascaded CCA is shown in Fig. 5.1, and a flow 
chart of the functions is given in Fig. 5.2. The input data consists 
of a sequence of local sources, each consisting of n image elements* 
The partial supervision as before specifies the constant iwmber of 
clusters m^ to be obtained for each local source. Each local source 
is then clustered as in the previous CCA forms to provide a set of m^ 
cluster features and a feature map for each local source. However, now 
the m^ cluster definitions and the feature map for each of the local 
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Fig. 5.2. Flow Chart of Cascaded CCA Functions. 
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sources are collected in memory for further processing. The user super- 
vision for the Cascaded CCA must in addition specify the user quality 
requirements on a sequence basis, as, for example, the number and type of 
cluster features desired to represent the sequence of local sources. 

Once the mo*N local clusters have been collected, they are clustered 

to obtain a set of m sequence clusters to be used as the spectral fea- 

s 

tures for the sequence of local sources. Assume the Cascaded CCA and 

the ground decoder agree to assign the integer labels l,2,...,m^ to 

sequence clusters, according to the order of transmission and reception. 

The cascaded clustering results in a mapping between the integer labels 

for the m^ *11 local clusters and the integer labels for the m sequence 
Xf s s 

clusters, as demonstrated in Fig. 5.3. The m sequence clusters are 

s 

output to provide the cluster definition per sequence at a sequence 

spectral data* rate defined as R (seq) . 

spec 

2. Local Spectral and Spatial Definition 

Figufe“5.4(a) shows sections of typical feature maps for the first 
and second local sources in terms of the local labels. Figure 5.4(b) 
shows the same feature map sections represented in terms of the 
sequence labels. The feature map for the Cascaded CCA is in terms of 
the sequence labels as in Fig, 5.4(b), However, since the total 
number of sequence clusters is usually larger than the number of 
sequence clusters used in a given local source the feature map repre- 
sentation in Fig. 5.4(b) is inefficient, A more efficient method of 
transmitting each feature map is to send [log 2 bits to identify 

the number of sequence cltisters m^ occurring within the b-th local 

s 
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SOURCE SOURCE 
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ft) 


Fig, 5.3. Mapping Between Local Fig. 5.4. Typical Sections of 

Cluster Labels and Feature Maps, (a) In 

Sequence Cluster local labels. (b) In 

Labels. sequence labels. 

source. Then send |~ log^ bits to identify each of the sequence 

clusters present in the b-th local source. This data is referred to 

as the local spectral data with a data rate of R.(loc ) , and an example 

spec 

of this data is given in Fig. 5.5 for the local source examples shown 
in Fig. 5.3 and Fig. 5.4. Assume the Cascaded CCA and the ground 
decoder agree to assign in the order of transmission integer labels 
1,2,... to the subset of sequence clusters present in a given source. 
These integer labels are defined as the subset labels, and they are 
assigned to the subset of sequence labels in the order of their 
transmission. An example of the mapping between the subset of sequence 
labels in a local source and the subset labels is shown in Fig. 5.6. 
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Fig, 5.5. Local Spectral Data 
for Local Sources 1 
and 2. 


Fig. 5.6. Mapping of Sequence 
Labels into Subset 
Labels for Local 
Sources 1 and 2. 


Observe that the mapping in Fig. 5.6 simply takes the sequence labels 
back into the original local labels of Fig. 5.3 except for the deletions 
of local clusters which were combined in the cascaded clustering. The 
> feature map for the Cascaded CCA can then be represented in terms of 
the subset labels as shown in Fig, 5.7. -The decoding of the subset 
labels is defined by the transmitted spectral data shown in Fig. 5.5 
which specifies for the decoder the mapping in Fig. 5.6. The trans- 
mission of the uncoded feature map in terms of subset labels (Fig. 5.7) 
requires [*1082 ™g*| bits per element, which is usually considerably 
less than rioS 2 bits per element required to directly transmit 
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Fig. 5.7. Cascaded CCA Feature Map in 
Terms of Subset Labels. 
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the feature map in terms of sequence labels (Fig. 5.4(b)), Also the 

overhead bits required to enable decoding of the subset labels is very 

small when averaged over all the samples in the local source. Consider 

local source I in the example of Figs, 5,3-'5,7, where m =16, m^*4, 

s s 

m^»4, and assume the number of elements n is 256. Direct transmission 
of the sequence labels in Fig, 5.4(b) requires 4 bits/element. Alter- 
nately, transmission of the overhead in Fig. 5.5 requires an average of 
(2 + 4 X 4)/256 “ .07 bit/element, and the transmission of the subset 
labels in Fig. 5.7 requires 2 bits/element, or a total of 2.07 bits/ 
element. Representation of the Cascaded CCA feature map in terms of 
subset labels significantly reduces the spatial data rate, while pre- 
serving a simple look-up table architecture for the decoding 
implementation. 

3. Implementation Considerations 

The primary new function in the Cascaded CCA of Fig. 5.1 is the 

cascaded clustering. Cascaded clustering impacts the implementation 

complexity in terms of memory and computation requirements. Additional 

memory is required to collect the local cluster definitions and the 

local feature maps which make up a sequence. The number of cluster 

definitions to be stored per sequence is where m^^ is the number 

of clusters per local source and the number of local sources per 

sequence. For example, assume m.=8 and N =16, and that the mean 

*<0 s 

and variance of each of 4 bands are stored for each cluster. Then the 

storage required is 1024 8-bit words. Each local feature map requires 

storage for n*N local labels, where n is the nuinfaer of elements per 
s 

feature map. A reasonable maximtmi value of m^^ is 16, so each label 
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takes 4 bits. Thus a sequence of 16 local sources, each of 16 x 16 ele- 
ments requires 4096 4-bit words of storage. These numbers indicate a 
relatively small amount of memory is adequate for the Cascaded CCA. 

The additional conq)utational requirements in the Cascaded CCA 

arise mainly due to the cascaded clustering. The relabeling of any 

% 

feature map from local labels to subset labels requires a simple map- 
ping as demonstrated in Figs. 5.3 and 5.5. However, the clustering of 
clusters requires the use of intercluster distance measures which are 
generally more complicated than the distance meastires used in cluster- 
ing vectors. A discussion of many intercluster distance measures is 
contained in [13], and an intercluster distance measure used for the 
Adaptive Clustering Algorithm (ACA) is discussed in Chapter II. An 
important attribute of the Cascaded CCA structure is that the more 
difficult computations require a correspondingly lower frequency of 
repetition. The more difficvilt computation of the cascaded clustering 

must be accon?)iished only once per the clustering of N local sources, 

s 

and similarly only once per the N *n vectors present in an entire 

s 

sequence. Therefore, the impact of cascaded clustering on the average 
computation per vector is greatly reduced. 

The basic concept of this cascaded structure is to use fast spe- 
cial purpose hardware for siiqple con 5 >utation at the data vector level 
for extracting local features, and then to use slower general purpose 
microprocessors for more complicated computation at the local source 
level for extracting sequence features. Of course, this cascading could 
be expanded with additional higher levels of even slower rate and more 
sophisticated computation. In addition, one could incorporate adaptive 
clustering within the Cascaded CCA at the local level, sequence level. 
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and higher. The generalized concepts of this cascaded architecture 
appear to be useful for applications in artificial intelligence and 
other sophisticated pattern recognition problems dealing with high data 
rate sources. However, the scope of this work is limited to formulation 
of the concept and investigation of a very sin^Jle form of the Cascaded 
GCA. 


4. A Specific Cascaded GCA Form 

A sin 5 >le form of the Cascaded CCA in Fig. 5.1 is defined below. 

The user supervision consists of specifying a constant number of local 
clusters to be obtained for each local source, a fixed number of local 
sources to be used for each sequence, and a constant number of sequence 
clusters to be obtained for each sequence. This user supervision is 
specified only once per entire image. The clustering of vectors is 
accomplished using the .Basic Clustering Algorithm discussed in Chap- 
ter II and shown in Fig. 2.3. The cascaded clustering is obtained from 
using the BCA again to cluster the local cluster means. This cascaded 
clustering is nonadaptive and uses the Euclidean distance between 
cluster means for a simple and crude measure of intercluster distance. 
Feature map encoding consists of the entropy encoder defined in Chap- 
ter IV. Let n be the number of elements per local source, the num- 
ber of local sources per sequence, m the number of sequence clusters, 

s 

and f the number of bits used to define each sequence cluster. The 

sequence spectral data rate R(seq) in bpppb for transmitting the 

spec 

sequence cluster definition is given by 


R(seq) = 
spec 


m »f 
s 


K ’n'd 
s 


(5.4) 
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Let mo be the number of local clusters per local source, and m^ the 

36 s 

number of sequence clusters present In the b-th local source. The 

average local spectral data rate R(loc) for defining how many and, 

- - spec - ' 

which sequence clusters are present per local source is given by 

flog^ fiog2 %~\ ^ b 

R(loc) = 1 + 777 ;^ 7 » 

spec n*d‘N^ s 


(5.5) 


b=i 


where 



(5.6) 


is the average number of sequence clusters per local source for the 
local sources in the image. Let d be the number of spectral bands and 
CR*^ be the feature map encoding compression ratio for the b-th local 
source. Then the average encoded spatial data rate for the feature map 
of subset labels is 



c 

spat 


r hi 
1 V 1 I 3 l°8z “s I 
3 cr‘ 

b=l 


(5.7) 


The total rate in bpppb for the Cascaded CCA is 

R(casc) = R(seq) + R(loc) + R^ , (5.8) 

tot spec spec spat 


or 


R(casc) 

tot 


°s‘f , ri°s 2 • °‘s 

N *n»d n*d n*d 

s 


1 To , 

1 Y' 3 I ^ I 

^ CR^ 

b^l 


(5.9) 
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5. Impact on User Data Processing 

The following is an ejiample of use of the Cascaded CCA. Assume 
% 

16 local sources (Ng=*16) of 16 ^ 16 elements (n='256) are clustered 
to 8 local clusters (mj 2 ^= 8 ) each and then the clusters are clustered 
to 16 sequence clusters (mg=16) . Assume each sequence cluster defi- 
nition consists of 7 bits to define within 1 percent the number of ele- 
ments per cluster, 32 bits to define the cluster mean in 4 bands 
(d=»4) , and 16 bits to define the cluster variance in 4 bands. Each 
cluster definition thus requires 55 bits (f=55) . The sequence spec- 
tral definition data rate from (5.4) is R(seq) = .0537 bpppb, A 

spec 

typical average number of sequence clusters present per local source 

would be about 5.5 (m “5.5). Then the local spectral definition 

s 

(R(loc)) in (5.5) is .0244 bpppb. There are 4 spectral bands and 
spec 

assume the average coded spatial data rate per band in (5.7) for the 

average of 5.5 subset labels is R =.5 bpppb. This corresponds to 

spat 

an average entropy encoder spatial data rate reduction of .81. Thus 
the total Cascaded CCA data rate is .578 bpppb. If the entropy coder 
was not used the total data rate would be about .7 bpppb. 

In some data systems (e.g. , Landsat) it is not sufficient' to 
provide data compression as a means for obtaining an approximate image 
with the transmission of fewer bits. In fact it can be more important 
to make the data compression provide a form of preprocessing which 
facilitates user data selection and user data interpretation. An accu- 
rate measure of how much any algorithm benefits user data processing 
can only come through interactive assessment with specific data users. 
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\ 

However, here we attempt to exemplify potential user processing bene- 
fits of the Cascaded CCA concept through continuation of the above 
example. , . ... ' ' 

Figure 5.8 (a) - (d) shows block diagrams of how the outputs of 
the Cascaded CCA might be used to aid user data selection and user data 
interpretation. Assume first that the user needs to know what of inter- 
est is there, but does not need to determine accurately where. Such 
information is useful for user data selection to quickly eliminate 
large sequences of data with nothing of interest. Also this informa- 
tion may be all that is needed, as in inventory assessments (e.g., 
determination of the percentage of crop t 3 ?pes) . The user can obtain a 
quick overview of the data by observing only the sequence spectral 
definition of the Cascaded CCA. In the above example, the sequence 
spectral definition provides a representation of what and how much is 
present in a sequence of 4096 picture elements in terms of 16 sequence 
cluster definitions. As shown in Fig, 5,8 (a), the user provides a 
classifier to identify which clusters represent data belonging to a 

class of interest. The use of R(seq) to identify which classes of 

spec 

interest are present per sequence of 4096 picture elements required a 
data rate of only ,0537 bpppb, and the user classification of only 
16 features. This reduces the data volume the user must handle by a 
factor of 112, and also reduces the number of user classifications to 
be made by a factor of 256. 

Asstnne the user has selected a sequence of 4096 picture elements 
which has one or more classes of interest, and the user now desires to 
obtain a better spatial definition for these classes. Let the class 
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Fig. 5.8. Block Diagram of User Data Selection and Interpretation for the Cascaded CCA. 

(a) Classes per sequence, (b) Classes per local source, (c) Class per 
vector, (d) Image approximation. 
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labels assigned to the 16 sequence clusters be stored in a 16 word 
look-up table. The local cluster definition is a list of which sequence 
clusters are present per-16-x 16 local source in the sequence. Fig- 
ure 5.8 (b) shows the use of the local spectral definition as an 
address to the look-up table with the class labels. The output is the 
identification of classes present per each 16 x 16 image elements, thus 
permitting a finer level of data selection within a larger sequence of 
image elements known to have classes of interest. Even when this finer 

J 

level (16 X 16 image elements) of data selection is used, the data rate 
is only .078 bpppb. Thus the data volume is reduced by 77, and the 
number of classifications still reduced by 256. 

In some applications the identification of what is present within 
4096, or 256 image element sets would be adequate. In other applica- 
tions the user may desire to know where the classes of interest occur 
on an image element by image element basis. Refer to Fig. 5.8(c). 

The subset of^class labels for each 16 x 16 local source are stored in 
an 8-word look-up table. The spatial data definition given by the fea- 
ture map is used as an address to define the class label appropriate 
for each image element. If feature map encoding is used then a 
decoding is required before obtaining the address. This additional 
detailed spatial information increases the data rate to .578 bppb. 

The data volume is reduced by 10.4, but still the nianber of classifi- 
cations is reduced by 256. 

Finally, assume the user desires instead a reconstructed image for 
manual interpretation. This application of the data is shown in 
Fig. 5.8(d). The 16 sequence clusters defined by the sequence cluster 
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definition data are stored in a 16-word look— up table. The local clus- 
ter definition data is used as an address to output the subset of 
sequence clusters which are present for each local source. The subset 
of sequence clusters are stored in an 8-word look-up table. Then the 
spatial data definition is used as addresses to output an approximate 
reconstruction for each picture element. The approximation typically 
would be the cluster mean, although the cluster variance per band could 
also be used in reconstructing an image. The data rate required for 
this image reconstruction is also .578 bpppb, representing a data 
reduction of 10.4. 

Observe that a combination of the functional elements in 
Figs. 5.8(c) and 5.8(d) give the capability for acquiring all or any 
subset of the user outputs simultaneously. Figure 5.8 also demon- 
strates that a simple look-up table architecture can be used for 
decoding. The above examples conceptually demonstrate how the 
Cascaded CCA benefits ground data handling by providing for data selec- 
tion and reduced interpretive computation. Some computer simulation 
results of the Cascaded CCA are given in the next chapter. Further 
research is needed to better assess the practical benefits and to 
Investigate other forms of this concept. 
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CHAPTER VI 

PERFORMAIJCE OF VARIOUS GCA CONFIGURATIONS 

The performance of- a d'a'ta compression algorithm is measured in 
terms of the data quality vs. the data rate. Data quality is most com- 
monly measured in terms of percent mean square error (%MSE), classifica- 
tion accuracy, and subjective appearance. 

Subjective appearance is an appropriate criteria when manvial photo- 
graphic interpretation is used. In such cases it is often too difficult 
to define a mathematical eicpression for adequately quantifying data 
quality. The obvioxis disadvantage of this criteria is that it is sub- 
jective rather than quantitative. 

Classification accuracy is an important quantitative measure of 
data quality in applications where the compressed data is automatically 
interpreted Ce.g., Landsat data applications). However, classification 
accuracy is dependent on the classification technique, as well as the 
data compression technique, and often significant accuracy improvements 
can be obtained by tailoring the classification technique to the speci- 
' fic data compression technique. Therefore, it is important to jointly 
investigate compression and classification when the data quality is 
based on classification accuracy. 

%MSE also provides a quantitative measure of data quality. This 
criteria is widely used because of its convenient mathematical form. 

The /31SE is not generally meaningful for measuring data quality in an 
absolute sense. For example, it would not usually be tiseful for com- 
paring data quality across different compression techniques, or across 
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different test images. However, the %MSE can be very useful for com- 
paring performance from options of the same compression technique and 
the same test image. 

In this chapter computer simulations are used to provide compara- 
tive performance results for the various CCA configurations. These 
performance con^arisons can be helpful in assessing the impact of data 
compression on the data user, and also for defining performance vs. 
implementation complexity trade-offs. In terms of implementation it is 
desirable to keep m, n, and the nxnnber of clustering iterations small 
in order to lower memory and computation requirements. It is also 
easier to Implement sin 5 >ler distance measures and nonadaptive config- 
urations of the CCA. However, the impact on performance of these param- 
eters must also be determined. Below, the impact on performance of 
various CCA configurations is given in terms of %MSE, subjective 
appearance, and classification accuracy. 

A. %MSE Performance Results 

/N 

Let X represent the original d-dimensional image data, and let X 
represent the reconstructed approximation of the same image data after 
compression. Then the MSE between the original image and the approx- 
imate image is the expected value of the square of the Euclidean 
distance between X and X, or 

MSE = E 1^1 i X - X ( l^J . (6.1) 

The average energy, 6 in the original image is defined in terms of the 

2 

variance in the i-th spectral band, i=l, 2, ...,dby 
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s= E 


i=l 


Thus the %MSE is defined by 
%MSE = 


( 6 . 2 ) 


(6.3) 


I . Uncoded CCA 

Various configurations of the CCA were simulated on the test image 
shown in Fig. 6.8(a). This test image is a 256 x 256 picture element 
subset from a 4-band Landsat image of the Verde Valley in Arizona. The 
purpose of these simulations is to obtain a measure of %14SE vs. total 
data rate as a function of the (n, m, CR) combinations. The first 
simulations used the Uncoded CCA of Fig. 3.1. This corresponds to 
determining performance as a function of (n,m) pairs for CR ® 1. The 
performance of this Uncoded CCA in terms of %MSE vs. for various 

values of n is plotted in Fig. 6.1. For each n shown, all square sub- 
sets of n picture elements in the test image were cong>ressed using 
various values of m. For exatq>le, for n =» 36 the test image was 
processed four times as follows: two features (m => 2) per each subset 

of 6 X 6 (n = 36) picture elements, three features (m = 3) per 6x6, 
four features (m =» 4) per 6x6, and five features (m => 5) per 6x6. 
These four simulation results were then plotted in Fig, 6.1 and a smooth 
curve was drawn between them. The cluster feature used for this image 
appro x i m ation application was only the cluster mean, and each cluster 
mean was defined by either 6 or 8 bits per band (f = 24 or 32) . There- 
fore, the image approximation consisted of substituting the closest 
cluster mean for each piccure element. For every simulation with a 
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Fig. 6.1. %MSE vs. fo.r Various Values of n for 

the Uncoded CCA. 

given n and m, the %MSE between the original and reconstructed image was 
calculated from (6.1) through (6.3) and calculated from (3.11). 

For n = 576 and n = 1024 the performance curves were essentially 
Identical. The results in Fig. 6.1 show that for the selected source 
data the Uncoded CCA peaks in performance for all at n approxi- 

mately equal to 256. Similar results have also been obtained from 
simulations with other test images. From (3.11) one can see that for 

constant R the spectral data definition decreases as n increases, 
tot 

and the spatial data definition increases. The simulations suggest 
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that as n increases above 256, the data quality (%MSE) is decreased 
more due to poorer spectral definition than it is improved due to 
better spatial defii^tlon. 

2. Coded CCA 

The same simulations that were conducted for the Uncoded CCA 

(CR^l) were also conducted with entropy coding of the feature map. 

This simulation configuration is called the Coded CCA, and is shown in 

Fig. 4.7. The purpose of these simulations was to measure %1'ISE vs. data 

rate as a function of (m,n) pairs when CR is determined by the entropy 

coding of the feature map. The entropy encoding technique used is that 

which was described in Fig. 4.5, with the feature map representation of 

SS_ as described in Chapter IV. The simpler feature map representation 
L 

of could also be used with only slightly decreased performance. 

Li 

Entropy encoding basically reduces the spatial data rate without 
Increasing the %MSE. The reduction in spatial data rate for the feature 
map representations S_ and SS_ are plotted in Fig. 4.6. The total 

Lj Li 

coded data rate R*^ is defined in (4.20). All overhead data rate costs 
tot 

have been included in the entropy coding results. 

The Coded CCA simulations were conducted in the same manner as the 
Uncoded CCA simulations, resulting in the performance curves shown in 
Fig. 6.2. The spatial rate reduction, or CR due to entropy coding was 
shown in Fig, 4.6 to be greater for larger n. Thus the performance 
curves for the Coded CCA will shift more to the left of the Uncoded CCA 
curves when n is larger. This effect is observed by comparing the Coded 
CCA curves in Fig. 6.2 with the Uncoded CCA curves in Fig, 6.1. The 
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Fig. 6.2. %MSE vs. R§ot Various Values of n for 
the Coded CCA. 

Coded CCA continues to increase in performance as n increases, but it 
is near peak performance for n = 256. 

3. Adaptive CCA 

In the next simulations, the Adaptive Clustering Algorithm (ACA) 
which was discussed in Chapter II was used to adaptively determine the 
number of clusters to use for each sequence of n picture elements. 

The Adaptive CCA configuration is the same as the Coded CCA in 
Fig. 4.7, except the clustering is adaptive. The total data rate B (A) 
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is defined in (5.3)- The advantage of the Adaptive CCA is that a lower 

average value of m can typically be used to obtain similar data -quality. 

The performance and rate of the Adaptive CCA are now dependent on the 

criteria for increasing or decreasing the number of clusters, as well as 

the combination of (n, m, CR) . The performance of the Adaptive CCA was 

observed to be less dependent on the size of n, with all results for n 

of 100 or greater nearly on the same performance curve. Although 
c 

R (A) was sensitive to changes in the criteria for determining the 
tot 

number of clusters used per sequence, the %MSE vs. rate performance was 
very insensitive to changes in that criteria. Tliere is a significant 
performance gain in the Adaptive CCA relative to the Coded and Uncoded 
CCA, This gain is shown in curves 3, 4, and 5 of Fig. 6,3, However, 
the performance gain due to adaptability is dependent on both the image 
data and user application. 

4. Performance Comparisons 

Tests were performed to compare the %MSE performance of the CCA 
with some other compression techniques which are well known. The two 
techniques used for comparison are the well-known Hadamard and Fourier 
transform techniques. A discussion of Hadamard and Fourier transforms 
in image coding is given in [6] through [S], Both techniques xiere used 
in an adaptive form, and the compression was done on each band of the 
test image independently. Some Improvement in performance could be 
obtained by modifying the technique to work jointly on all four bands. 
The Hadamard transform technique involved taking the two dimensional 
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Fig. 6.3 %MSE vs.. Rtot Var^.ous CCA Configurations and for 
Adaptive Hadamard and Fourier Techniques. 


Hadamard transform of all 8x8 subsets of the image. Then the 64 coef- 
ficients were placed in 9 zones, each of which had the quantization and 
corresponding bit rate per coefficient based on the variance, or energy 
of the coefficients in the zone. The Fourier transform technique was 
similarly adaptive, but in addition a symmetrical transform approach 
was used which doubly folded each data subset to provide horizontal and 
vertical symmetry. This symmetry reduces the intensity discontinuities 
at the boundaries due to the low pass filtering. The %MSE vs. total 
data rate for these two adaptive transform techniques is shown in 
curves 1 and 2 of Fig. 6.3 along with the CCA performance curves. 
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In. Fig. 6.4 a summary of the above simulation results is given by 
plotting required for 2% USE vs. n for the various compression con- 

figurations. The data points plotted in .Eig. -6v4 were interpreted from 
the smooth curves in the previous figures, as well as from other simula- 
tion results. These curves point out a substantial %MSE penalty for 
n<100 and that a good value for n would be 256. They also show a per- 
formance gain of about 20 percent due to entropy coding, and about 
another 20 percent due to being adaptive. The %MSE vs. performance 

results agree well with the subjective measure of image appearance for 



Fig. 6.4. Rtot ^ Various CCA Configurations 

and for Adaptive Hadamard and Fourier Techniques. 
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the three CCA forms simulated above. This would be expected since the 
Uncoded, Coded and Adaptive CCA all introduce the same type of image 
degradation. In particular the coding used in the Coded CCA changes the 
spatial data rate, but does not change the image quality. 

5. Cascaded CCA and Feature Map Subsampling 

A simple form of the Cascaded CCA was defined in Chapter V, and a 
diagram is shown in Fig. 5.1. Points A and B in Fig. 6,5 represent 
performance results from two simulations of the Cascaded CCA on the test 
image of Fig. 6.8(a), Point A results from clustering a sequence of 
16 16 X 16 local sources into 8 clusters each, and then clustering the 
128 local clusters into 16 clusters per sequence and an average of 5.44 
subset clusters per 16 x 16 local source. Point B results from cluster- 
ing a sequence of 16 16 x 16 local sources into 4 clusters each, and 

then clustering the 64 local clusters into 16 clusters per sequence and 

c 

an average of 3.37 subset clusters per 16 x 16 local source. R (case) 
tot 

is determined from (5.8) with CR**1 to account for not using entropy 
encoding in the following comparisons. Curve 2 is a performance curve 
for the Cascaded CCA (Uncoded) inq>lied by the simulation results of 
points A and B. Curve 1 is the performance curve for the Uncoded CCA 
with n=256 which was presented previously in Fig, 6.1. The Cascaded 
CCA is compared with the Uncoded CCA such that both CCA configurations 
are uncoded and nonadaptive. These limited simulations showed the 
uncoded Cascaded CCA as having a performance somewhat less than 

the Uncoded CCA. This unexpected result might be due to the crude 
intercluster distance measure used for this specific cascaded cluster- 
ing, or perhaps due to a poor choice of user supervision parameters. 
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For point A, for exaisple, the spectral data definition requires a data 
rate of only .047 bpppb out of the total data rate of .665 bpppb. 

Further investigation and simulations of the. Cascaded CCA are- needed to 
adequately define its performance characteristics. 

At this time the results of a simulation are presented which 
incorporates the use of feature map subsampling in the above Cascaded 
CCA examples. Feature map subsai!q)ling was discussed in Chapter IV as a 
simple means of reducing the spatial data rate by subsampling feature 
map elements for transmission, and then using linear interpolation for 
approximate reconstruction at the destination. The above Cascaded 
CCA examples are used to demonstrate this alternate feature map data 
reduction, because of the very high ratio of spatial data definition to 
spectral data definition. The spectral data definition is unaffected 
by the feature map subsan^ling. Curve 3 is a performance curve drawn 
between simulation results obtained from augmenting the Cascaded CCA 
simulation cf point A with subsampling by 2, and subsampling by 4. Per- 
formance Curve 4 is obtained by similarly augmenting the Cascaded CCA 
simulation of point B. The total data rate in each case is obtained 
directly from the cascaded total data rate by simply reducing the 
spatial data rate by the subsampling factor. Observe that the slopes of 
these performance curves are more favorable, even though subsan^ling is 
a very crude means of data reduction. This result appears to Indicate 
the importance of properly distributing the total data rate between the 
spectral and spatial portions. The meaningfulness of using %MSE as a 
data quality measure is less clear in the example of Fig. 6.5,^^ecause 
of the different types of data degradations being compared. However, 
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Fig. 6.5. %MSE vs. R-tot the. Cascaded CCA With CR = 1 
and With Feature Map Subsampling. 

%MSE appears to provide a meaningful performance criteria for comparing 
similar data degradations resulting from the various CCA simulations 
shown in Figs. 6.1 through 6.4. 


6. Convergence Rate of Iterative Clustering 

The key element of the clustering approaches used in this work 
is the iterative clustering which makes up the Basic Clustering 
Algorithm (BCA) shown in Fig. 2.3. Each iteration of the BCA consists 
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of an assignment of all the vectors to the nearest mode center, followed 
by the recalculation of the mode center as the average of all vectors 
assigned to it. Convergence- of the BCA oitcufs when none of the vectors 
change assignment, since then the mode centers also do not change. In 
the above %MSE performance simulations the clustering was allowed to 
iterate until convergence. However, since the clustering computation is 
linearly related to the number of Iterations, it is necessary to 
investigate the relationship between clustering performance and the 
number of iterations. 

Figure 6,6 shows the average number of iterations required' for con- 
vergence as a function of m and n from simulations on the test image of 
Fig, 6.8(a). As expected. Fig. 6.6 shows the number of iterations 
increase with increasing m and n. Of course the number of iterations 



Fig. 6.6. Average Number of Iterations Required for BCA 
Convergence as a Function of n. 
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for convergence will fluctuate about the average dependent on the 
specific data set. Since it is much more difficult to design hardware 
to handle a varying number of clustering iterations, it is desirable to 
constrain the number of iterations to some maximum number. Intuitively 
one would anticipate that many of the last iterations before convergence 
are just reassigning a small number of vectors which are located nearly 
equidistant from more then one mode center. The final assignment of 
such vectors will likely have little impact on the clustering perform- 
ance. Simulations were conducted to determine the impact on %MSE per- 
formance due to constraining the ntanber of clustering iterations. The 
results of these simulations are shown in Fig. 6.7 where the %MSE is 
plotted as a function of the maximum number of iterations allowed for 
various values of m and for n=256/ ' These results indicate that with 
respect to %MSE performance the number of clvister iterations *could be 
limited to a maximum of about 4 iterations with little performance 
loss. This is a significant result in terms of practical implementation 
considerations . 

7* Distance Measure Impact on Performance 

The primary function of the Basic Clustering Algorithm (BCA) is the 
assignment of vectors to the nearest mode center. As discussed in 
Chapter II, there are many different measures of the distance between a 
vector and a cluster. A more general measure might use the sample 
covariance of the cluster in addition to the sample mean to account for 
a possibly nonsymmetric distribution [24], [25]. In order to simplify 
implementation, the BCA was assumed to use either Euclidean distance 
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Fig. _6.7_. %MSE vs. Maximum Number of Iterations Allowed 
in BCA for Various Values of m and n = 256. 

measure as in (2.1), or the absolute value distance measure as in (2.2). 
The absolute value is a very crude distance measure compared to the 
Euclidean distance in hi^er dimensional space (e.g., four dimensional 
spectral intensity space). Thtis an indication of the sensitivity of the 
clustering performance to the choice of distance measure can be obtained 
by comparing results from the use of these two distance measures. 

Table VI compares the simulation results from using Euclidean 
and absolute value distance measures in the Coded CCA. The simulations 
for Table VI used 16 x 16 element local sources of the four dimensional 
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TABLE VI 

COMPARISON OF EUCLIDEAN AND ABSOLUTE VALUE DISTANCE MEASURES 


Coded CCA 
n ■ 256 
d - 4 
f - 24 

Euclidean Absolute Value 



m 

tot 

%MSE 

■'tot 

%MSE 


2 

.236 

10.43 

.238 

10.45 


5 

.562 

2.63 

.567 

2.70 


8 

.777 

1.69 

.779 

1.77 


16 

1.19 

1.036 

1.19 

1.134 

test image in Fig. 

6.8(a). 

Simulations were 

conducted 

to obtain 2, 5, 

8, and 16 clusters 

per local source 

, and each 

cluster centroid was 


defined by 24 bits. The results show that the absolute value distance 
measure degrades the %MSE performance by about .02 %MSE at low total- 
data rates, and the degradation Increases to about .1 %MSE at high total 
data rates. One would expect the quality of the distance measure to be 
more important at higher total data rates, since the Increased number of 
clusters make the assignment function more dependent on the distance 
measure. However, even for the higher data rates, the %MSE degradation 
due to using the absolute value distance measure is very small. This 
indicates that the clustering performance is quite insensitive to the 
distance measure used in the assignment process. The results in 
Table VI are also typical of results obtained from simulations with 
other CCA configurations and other test images. 

B. Subjective Image Appearance Performance 

Figures 6.8 through 6.13 provide examples of reconstructed images 
from simulations of the various CCA foTnas. The image reconstructions 
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(a) 6 bpppb 




(c) .88 (.78) bpppb 


(d) .917 (.885) bpppb 


(*) Original test Image of 256 x 256 picture elements 
quantized to 6 bpppb. A subset of a Landsat 1 
image of the Verde Valley, Arizona: band 2 selected 
for display from a total of 4 bands. 


(b) 


(e) 

(d) 


16 X 16 local sources with 8 clusters each, 
n«256, m*8, f»24, d-4; 

"spec * "spat ■ •75(.S89), • .937(.776) bpppb. 

10 X 10 local sources with 5 clusters each, 
n»100, m*5, f*24, d»4; 

"spec * -3* «spat * -see. 48), - .88(.78) bpppb. 

6x6 local sources with 3 clusters each, 
n*36, m*3, f«24, d»4; 

"spec * "spat * -417(.385), R^^^^ • .917(.885) bpppb. 


Fig. 6.8. Examples of the Uncoded (Coded) CCA 
for Different Values of m and n. 
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(a) 1.03 (.921) bpppb 



(b) .74 (.67) bpppb 



(c) .597 (.516) bpppb 



(d) .37 (.327) bpppb 




(a) 6 clusters per local source, 
n*100, m«6, f«24, d-4; 

“spec ■ -3®' “spet * ■ 1.03(.«1) bpppb. 

(b) 4 clusters per local source, 
n«100, m»4, f*24, d»4; 

%pec • "spat • "tot • ^'PPP*'- 

(c) 3 clusters per local source, 
n*100, m*3, f*24, d»4; 

“spec ■ “spat * R^^^ • .597(.516) bpppb. 

(d) 2 clusters per local source, 
n«100, m*2, f»24, d»4; 

“spec • -’2. Rjp,t • .25{.207). - .37(.327) bpppb. 


Fig. 6.9 Examples of the Uncoded (Coded) CCA for a varying 
number of clusters in a constant local source 
size of 10 X 10 elements . 
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(a) 8 bpppb 



(c) 1.18 (1.04) bpppb 



(b) 1.18 (1.04) bpppb 



(d) 1.18 (1.04) bpppb 


(a) Original girl test Image of 2S6 x 256 picture elements 
quantized to 8 bpppb. Red band selected for display 
from a total of 3 bands. 


(b) Euclidean distance measure. 
n-100, m«5, f"24, d»3; 

Spec * Spat • -777(.639). R^^^ ■ 1.18(1.04) bpppb. 

(c) Absolute valued distance measure* 
n*100* m»5, f-24, d*3; 

Spec ■ Spat ’ Rtot * 1-18(1.04) bpppb. 

(d) Maximun of 4 clustering iterations, 
n»100, m*5, f*24, d*3; 

Spec • Spat ■ -777(.641). R^^^ • 1.18(1.04) bpppb. 


Fig. 6.10. Examples of the Uncoded (Coded) CCA with 10 xlO element 
local sources and 5 clusters each, but for different 
distance measures and for a limitation on the number 
of clustering iterations. 
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(b) 1.16 (1.05) bpppb 



(a) 10 X 10 local sources with an average of 4.0 clusters each, 
n»100, average ni*4.0, f»24, d*4; 

^pec • ^pat • •^S^(-374), -.735(.623) bpppb. 

(b) 10 X 10 local sources with an average of 5.6 clusters each, 
n*100, average m»5.6, f»24, d*3; 

%pec * %pat * *70(.59), « 1.16(1.05) bpppb. 

(c) 16 X 16 local sources with an average of 5.47 clusters each, 
n«256. average m*5.47, f»24, d*3; 

%pec • ^pat * *72(.51), R^^^ * .896(.686) bpppb. 

(d) The same image as in (c), except the feature map is subsampled 
by 2 and reconstructed through linear interpolation. Entropy 
coding is not used. 

%pec * "spat ■ -3S- "tot * '”6 bpppb. 


Fig. 6.11. Examples of the Adaptive CCA, including variations 
of no coding, entropy coding, and subsampling of 
the feature map. 
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(b) .356 bpppb 



(c) 


.583 (.403) bpppb (d) .183 bpppb 

(a) 16 X 16 local sources are initially clustered to 8 
clusters each. The Cascaded clustering combines 128 
local clusters from 16 local sources Into 16 sequence 
clusters* and an average of 5.44 sequence clusters 
are used per local source. 

n«256, m-*8. f*24, d*4, N5*16, ni-»16* ins*5.44; 

R(seq)-.023, R(loc)«.024, R *.617(.453), R -.664(.50) bpppb. 
spec spec spat tot 

(b) The same Image as (a), except the feature map Is subsampled by 2. 

R(seq)«.023* R(loc)-.024, R-.^a.-.309* R_.*.356 bpppb. 
spec spec spat tot 

(c) 16 X 16 local sources are Initially clustered to 4 clusters 
each. The cascaded clustering combines the 64 local clusters 
from 16 local sources Into 16 sequence clusters, and an 
average of 3.16 sequence clusters are used per local source. 
n«256, mt-4, f-24, d-3, N5-I6, ms-16, ii1s«3.16; 

R(seq)-*031, R(loc)-.019, %p3^*.533(.403), 583 (.403) bpppb. 


(d) The same Image as (c), except the feature map Is subsampled by 4. 

R(seq)-.031, R(loc) - .019, R -.133, R -.183 bpppb. 
spec spec spat tot 

Fig, 6.12. Examples of the Cascaded CCA, including variations 
of no coding, entropy coding, and subsampling of 
the feature map. 
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(a) 6 bpppb 



(b) 8 bpppb 



(c) .937 (.776) bpppb 



(d) 1.25 (1.01) bpppb 


TKODUCIBILITY of the 

aoWAL PAGE IS POOE <*' 


False color composite of Verde Valley test image. 
First three multi spectral scanner bands selected 
for display in blue, green, and red respectively. 


(b) Color composite of girl test image. 


(c) 


16 X 16 local sources with 8 clusters each, 
n»256, m*8, f*24, d*4; 


spec 


.187, R, 


spat 


.75(.589), R 


tot 


.937(.776) bpppb. 


(d) 


16 X 16 local sources with 8 clusters each, 
n*256, m*8, f*24, d*3; 

R « .25, R_,„ * 1.0(.764), R^^^ = 1.25(1.01) bpppb. 


spec 


spat 


Fig, 6.13. Examples of color composites for the 
Uncoded (Coded) CCA. 
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consisted simply of substituting for each picture element the centroid 
value of the cluster to which the picture element was assigned. Other 
cluster parameters such as the variance per band could also be useful 
for image reconstruction in special cases. In all the CCA simulations, 
all the bands of the test image are jointly processed, and the band 
with the highest signal energy is selected for display in black and 
white. Two different types of image data shown in Fig. 6.8(a) and 
Fig. 6.10(a) were selected to demonstrate typical performance in 
terms of subjective image appearance. One example of color composites 
of these test images are also included in Figs. 6.13(a) and (b) . Simu- 
lations were conducted on many other test images , and the CCA perform- 
ance was found to be very similar for wide variations in the type of 
test data. 

Figures 6.8(b) through (d) demonstrate reconstructed images which 
have similar total data rates, but widely varying local source sizes 
and number of clusters per local source. Note that the use of entropy 
coding on the feature map lowers the spatial data rate, but does not 
cause any changes to the reconstructed image. Therefore, the Uncoded 
CCA and the Coded CCA have the same resulting reconstructed image, but 
different total data rates. Throughout the image examples, the total 
data rate corresponding to use of entropy coding on the feature map is 
given in parentheses. Figure 6.8 indicates considerable improvement in 
using local sources larger than 6x6 elements, but the differences 
between local sources larger than 10 x 10 elements appear small. In 
addition, the result for a 6 x 6 element local source does not benefit 
as well from entropy coding. 
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example in. Fig. 6.11(c). The feature map is subsampled by a factor of 
2, and then reconstructed by using linear interpolation. Thus the 
spatial data rate is reduced by a factor of 2, and the iniage appjaarance 
degrades mainly due to smoothing of sharp edges in the image. 

Images resulting from simulations of the Cascaded CCA are shown in 
Fig. 6.12. Figure 6.12(a) resulted from clustering 16 16 x 16 local 
sources into 8 clusters each, and then clustering the 128 local clusters 
‘into 16 sequence clusters. The average number of sequence clusters used 
per local source is 5.44. In Fig. 6.12(c), a sequence of 16 16 x 16 
local sources are clustered to 4 clusters each, and then the 64 local 
clusters are clustered into 16 sequence clusters. For this image the 
average number of sequence clusters used per local source is 3.16. In 
both Fig. 6.12Ca) and Cc) the spectral data rates are very low. This 
restilts in heavy contouring within the image which significantly 
decreases image appearance performance. However, this CCA configuration 
is more so intended fox automatic interpretation applications. In 
such applications it is often desirable to sharply define boundaries 
which separate a relatively limited number of classes with different 
spectral signatures, but it is not Important to preserve the intraclass 
scatter information within the boundaries. Furthermore, if the spatial 
boundary information is not in 5 >ortant, the spectral signature informa- 
tion could be obtained for only the very low spectral data rates. The 
loss of intraclass scatter information is roughly equivalent to contour- 
ing within the image. Generally, contouring degradation of the data 
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impacts image appearance performance more so than it would impact auto- 
matic interpretation performance. However, the image appearance per- 
formance is a subjective measure which greatly depends on the specific 
use of the image data. 

Figure 6.12, (b) and (d) , represent the use of subsampling on the 
feature maps from the Cascaded CCA results of Fig. 6.12, (a) and (c), 
respectively. In Fig, 6.12(b) the feature map was subsampled by a 
factor of 2 and the spatial data rate reduced by a factor of 2. In 
Fig. 6.12(d) the feature map was subsampled by a factor of 4 to reduce 
the spatial data rate by a factor of 4, The impact of feature map sub- 
sampling is mainly the poorer definition of sharp transitions in the 
image. The overall appearance quality of these two images is quite low. 
However, the total data rate is also quite low. For the image in 

Fig. 6.12(d), the total data rate is reduced by a factor of about 44. 

\ 

In Fig. 6.13(a) and (b) are shown color composites of the two test 
images. Fig. 6.13(a) is a false color composite which displays the 
first three bands of the Landsat I multispectral scanner image in 
blue, green, and red respectively. Figures 6.13(c) and (d) show 
reconstructed color composites from simulations of the Uncoded and 
Coded CCA. Each simulation uses 16 x 16 local sources with 8 clusters 
each. The image appearance performance of the CCA is usually better 
when judged in terms of color composites instead of black and white 
images. This apparently results because the clustering algorithms 
are attempting to obtain a clustering representation which minimizes 
the average error in a multidimensional sense, rather than trying to 
minimize the average error in each band of the image independently. 
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C. Classification Performance 

Classification performance is specified in terms of the classifica- 
tion accuracy for a given data rate. Classification accuracy is 
dependent on both the classification technique and the data compression 
technique. Furthermore, it is also possible that the performance with 
a given classification technique may be significantly improved through 
adaptations which account for the changes in the data catised by a given 
data compression technique. There are of course many different 
classification approaches (e.g., parametric, nonparametric, supervised, 
nonsupervised) . Similarly the classification might be on a per vector 
basis, or alternately include texture. A very common classification 
approach in Landsat applications as well as elsewhere is the Gaussian 
parametric classifier on a per vector basis. This classifier is often 
used because of its relatively simple implementation. In this disser- 
tation, the Gaussian parametric classifier is selected for investigating 
the classification performance of the CCA. In addition, this example 
demonstrates the interaction between data classification and data 
compression. 

1. The Gaussian Parametric Classifier 

A model for statistical classification is shown in Figure 6.14, 

Let be the vector in measurement space and assume there are q classes 
{w^}, j»l,2, . . ,qin pattern space each characterized by a con- 
ditional density function p(X/Wj), and a prior probability P(w^), If 
the classifier G(*) is picked to minimize probability of error, then the 
solution for G(») is to maximize the a posteriori density function, or 
equivalently to choose 
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Xew. iff p(X/w.) > p(X/w, ) p(w,) V k j*. ^ (6.4) 

The classifier defined in (6.4) is commonly referred to as the Maximum 
Likelihood Classifier. The classifier solution for minimizing a more 
generalised average cost criteria is given in Cl2, p« 9]. 

The classifier in (6.4) is in terms of unknown density functions 
which need to be estimated. In parametric classification a functional 
form for the density functions is assumed, and only a set of parameters 
need to be estimated in order to completely define the density func- 
tions. A functional form which results in a classifier that is rela- 
tively easy to implement is the Multivariate Gaussian, or Normal 
distribution. For this parametric assumption, each conditional 
density function is entirely defined by a mean vector and a covariance 


matrix. Assume the measurement vector X which is generated by class 


Wj is normally distributed with mean and covariance [$j]. Also, 
assume the covariance matrix has an inverse Then the d-variate 

conditional distribution function of X is 


° - Uj) j , (6.5) 

where |$j| is the determinant of Note that the classifier in 

(6.4) is unaffected by taking the log of each side of the Inequality 


'Ties may be arbitrarily decided. 
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since the log is a monotonic function. Define a set of discriminant 
functions (X) by 

g^(X) =* log {p(X/w^) P(w^)} . (6.6) 

Then from (6.4) the classifier is also defined by 

Xew. iff g.(X) > g^(X) V j . (6.7) 

By using the parametric assumption of (6.5), the discriminant function 
of (6.6) becomes 

SjCX) =-f log ZTT-jlog I^ji'*|(X-Uj)^ * 

( 6 . 8 ) 

Now assume the a priori class probabilities are equally likely, or 

P(w.) = P(w„) = . . . = P(w ) = . (6.9) 

l z q q 

Since constants can be removed from both sides of (6.7), the discrimi- 
nant function in (6.8) can be reduced to 

Sj(X) --^iog |$^1 -|-(X-U^)^ ‘ 

Thus, the Gaussian parametric classifier is defined by (6.7) and (6.10), 
where only the mean and covariance of the conditional distri- 

bution function in (6.5) must be estimated. 

The mean and covariance of the measurement vector X conditioned on 
class Wj must usually be estimated from samples of X which are 
Independently known to have resulted from class w^ . Such a set of 

measurement vectors are referred to as a training set for class w., 

3 

and classification methods which require such training are referred 
to as supervised classification methods. Training a classifier can 
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sometimes be a difficult task, and at the same time the classification 
accuracy is very dependent on accurate training. Similarly, the use 
of data compression can greatly impact classification accuracy by 
affecting the training of the classifier. 

The following is an example of typical training of the Gaussian 

parametric classifier for Landsat applications. For Landsat w^* might 

correspond to a specific earth resource class with a well defined 

spectral reflectance. However, the class w, must be distinguished 

J 

from other classes based on the measurement X from the multispectral 
scanner (MSS) instrument. If the distribution of the MSS instrument 
output for class can be reasonably approximated by the normal 
distribution in (6.5), then determining the discriminant function 
in (6,10) for w^ reduces to estimating the mean and covariance of 
X resulting from the Landsat example, samples of X for w^* 

are obtained by identifying subsets within the image data which are 
known to represent w^ via independent ground truth information. 
Typically, a total training set for a given class consists of a small 
number of subsets (e.g., 4) of about 100 picture elements each, which 
are selected from various locations throughout the image. However, 
obtaining good ground truth information and registering the MSS image 
elements with the ground truth is often costly and time consuming. 
Furthermore, the measurement of spectral radiance X for a given class 

A 

w^ is also a function of sun angle, ground slope, ground moisture, 
atmospheric absorbtion, extraneous radiation, MSS instrument noise, 
etc. Thus, the mean and covariance of X obtained from a training set 
for class Wj is also a function of these unpredictable and time 
varying disturbance effects. Therefore, it is generally necessary to 
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repeatably retrain the classifier due to changes in time or location 
of the image data. 

Assume that a total of training set vectors have been obtained - 

from an image for class w, . Let be the m-th element of U., and let 

3 m — 3 

<J)^ be the element in the m-th row and n-th column of [$.j. Also, let 
tail j 

be the m-th element of the k-th measurement vector in the training 
set for class . Then the usual estimators for ^ and ] are given 
by 


m T mk 


( 6 . 11 ) 


k=l 


and 


4 " l- 2 (4 - “i) (4 ■ “n ) , 

^ k“l 


( 6 . 12 ) 


with 


3 “ 1>2, ...,q 
m ^9 ^9 . . . , d 
n — 1, 2, . . . , d. 

The classifier used for testing the classification performance of the 
CCA is the Gaussian parametric classifier defined by (6.7) and (6,10). 

The mean and covariance parameters for (6.10) are determined from 

i 

training sets as defined by (6.11) and (6.12). 


2. Performance from Training and Classifying Original Data 
The following assessment of classification performance for the CCA 
basically consists of finding the changes in classification accuracy of 
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the above classifier due to both training and classifying with CCA 
processed data rather than with original data. 

The test image selected for testing classification performance is 
a 4 spectral band, 256 line by 220 sample subset from the Flight Line 
C-1 image taken over Indiana on June 28, 1966. The original image con- 
sists of 12 spectral bands and 900 lines by 220 samples. The lines 
selected for the test image are lines 200 through 419. The spectral 
bands selected for the test image are bands 9, 10, 12, and 1, which 
correspond respectively to ,62-.66li» .66-.72y, .80-l.Oy, and .40-.44U* 
Band 12 is selected for display in Fig. 6.15(a), and a false color 
composite of the test image using bands 9, 10, and 12 is shown in 
Fig. 6. 16 (a) . Further detailed information on this test image can 
be obtained from [32]. 

A primary reason for using the Flight Line C-1 data is the avail- 
ability of reasonably reliable ground truth corresponding to as many as 
8 classes. Table VII lists eight classes of interest in the test image 
and the location and size of training sets for each class. These train- 
ing sets were carefully selected to avoid suspicious areas within the 
fields (e.g., apparent ditches). The mean and covariance of the 
training set samples for each class were determined from (6. 11) and 
(6.12), and all the picture elements of the test image were classified 
by the Gaussian parametric classifier defined in (6.7) and (6.10). If 
the maximum discriminant value in (6.10) for a picture element was 
below -70, the picture element was assumed to belong to none of the 
8 classes and was assigned a label of class 9. The results of training 
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TABLE VII 

TRAINING SETS FOR FLIGHT LINE C-1 TEST DATA 







Class 

Number 

All Training Sets are 
Class Name 

10 X 10 Elements 

Starting 

Line 

Starting 

Sample 

1 

Soybean I 

1 

135 

2 

Soybean II 

40 

130 



110 

60 



240 

10 

3 

Corn I 

1 

20 



26 

40 



120 

10 



210 

120 

4 

Com II 

190 

10 

5 

Corn III 

180 

40 



220 

10 

6 

Red Clover 

40 

10 



180 

70 



240 

140 

7 

Oats 

130 

130 



170 

140 



220 

60 

8 

Wheat 

90 

190 



100 

130 



150 

110 


and classifying with the original test image data is shown in 
Figs. 6.15(b) and 6.16(b), where every picture element is coded to 
identify its classification to one of 9 classes, and the relationship 
of class number, grey level, and color are given in Fig. 6.17. 
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iuu 


REPKODUCIBlUTi 

original page is poo 



(c) .625 (.429) bpppb (d) .625 (.429) bpppb, 

P = 3.4% 
e 


(a) The test Image Is a 4 band, 256 line by 220 sample subset from 
the 12 band, 900 line by 220 sample Flight Line C-1 image. The 
test image consists of bands 9, 10, 12, and 1, and lines 200 through 
419. Band 12 Is selected for display and the data rate Is 8 bpppb. 

(b) Grey-level coded classified linage obtained from training and clas- 
sifying on the original test data with the Gaussian parametric clas- 
sifier. The classification performance Is P ■ 7.3X for R,. ^ ■ 

8 bpppb. * 

(c) Uncoded (coded > CCA compressed image. 16 x 16 local sources with 
4 clusters each, n * 256, m » 4, f ■ 32, d » 4; 

^pec * R^q^-. 625(.429) bpppb. 

(d) Grey-level coded classified image obtained from training and 
classifying on the CCA compressed image with the Gaussian para- 
metric classifier (8=2). The classification performance Is 

Pg - 3.4% for « .625 (.429) bpppb. 

Fig. 6.15. Original, compressed, and classified images of the 
Flight Line C-1 test data. 
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(a) 8 bpppb 



(c) .625 (.429) bpppb 



(b) 8 bpppb, » 7.8% 



(d) .625 (.429) bpppb, 

P » 3.4% 
e 


(a) Color composite of original test image with bands 9, 10, and 12 
used for blue, green, and red respectively. The data rate Is 

8 bpppb. 

(b) Color coded classified image obtained from training and clas- 
sifying on the original test data with the Gaussian parametric 
classifier. The classification performance Is P * 7.8% for 

“tot * ® * 

(c) Uncoded (Coded) CCA compressed image. 16 x 16 local sources 
with 4 clusters each, n » 256, m * 4, f * 32, d = 4; 

“spec* -’25. «spat*-*^-^°^’* Rtot*-®25(-«9) bPPPb- 

(d) Color coded classified image obtained from training and clas- 

slfylt^ on the CCA compressed image with the Gaussian parametric 
classifier (8*2). The classification performance Is P • 3.4S 
for • .625 (.429) bpppb. ® 


Fig. 6.16. Color composite original, compressed, and classified 
images of the Flight Line C-1 test data. 
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REPRODUCIBILITY OF THE 
ORIGINAL PAGE IS POOR 

CLASS NUMBER GREY-LEVEL COLOR 

1 

2 

3 

4 

5 

6 

7 

8 
9 

Fig. 6.17. Relationship Between Class Number, Grey-Level, 
and Color for the Classified Images. 

A simple quantitative measure of classifier performance is the 

percent probability of error in classifying a picture element. An 

estimate of P can be obtained from the ratio of the number of incor- 
e 

rectly classified picture elements to the number of picture elements 
classified. A test set of picture elements is selected for each class 
for the purpose of estimating P^. These test sets basically consist of 
most of the picture elements for each class minus those elements for 
which the correct classification is uncertain. Those picture elements 
near borders of the field, or within regions of apparent ditches and 
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other anomalies are excluded from the test set. Such picture elements 

are likely the measured average reflectance over a spatial resolution 

cell which substantially consists of more than one of the classes 

(Including class 9), and therefore cannot be simply classified as 

correct or Incorrect. The location and size of the test sets for each 

class are given in Table VIII. The large size of the test sets are 

chosen to give a higher confidence level to the estimate for P , and 

e 

also to decrease statistical dependence between the training and test- 
ing procedure. A large test set is of even greater importance when 
estimating for classifying compressed data, because the correlation 
between reconstructed picture elements is often increased. This 
increased correlation tends to cause error events and no error events 
to occur in bursts Instead of In a more uniform spatial distribution; 
thereby necessitating a larger test set for the equivalent confidence 
level in estimating P^. The classifier accuracy is defined by P^, 
where 

p ^ 100 X Number of Incorrect classifications in test set 
e Number of elements in test set 

For training and classifying the original test image the classifier 
accuracy is P^ * 7.8%. The classification performance is given by 
P^ * 7.8% for a total data rate of 8 bpppb. 

3. Performance from Training on Original Data and Classifying- 
Compressed Data 

We now consider the impact on classification accuracy due to first 
compressing the data with the CCA. Any data compression algorithm can 
affect the classification accuracy in two ways, 1) by changing the 
conditional distributions of X, and 2) by changing the training data 
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used to estimate the conditional distributions of X. Consider first 
the impact of changing the conditional distributions of X. Uszaally 
one thinks of data compression as simply causing a spreading -of the 
conditional distributions, which results in an increased overlap of 
the distributions and an increased error rate. This is typically the 
case when the data compression approximates the data based on a 
criterion not closely related to minimizing the error probability of 
the classifier. However, better classification performance shovild be 
obtainable from a data compressor which specifically provides the 
classifier with information which aids accurate classification. 

Clearly, the best obtainable classification accuracy from the original 
data is also a bound to the classification accuracy which can be 
obtained from the compressed data. However, the actual classifier is 
usually not the best classifier due to practical memory and computation 
limitations in estimating the conditional distribution functions. 

For example, the Gaussian parametric classifier reduces computation 
by classifying each vector independently, while it is obvious a more 
complex classifier using surroimding vector information would have 
better classification accuracy. When the classifier is suboptimal 
(a most common occurrence) , the use of data compression can in fact 
result in an improvement in classification accuracy. The accuracy 
obtained with the original data is no longer a bound to the accuracy 
obtainable by the same classifier using the compressed data. With the 
CCA, for example, the data compression modifies each vector based on 
surrounding vector information, such that the simple vector-by-vector 
classifier is indirectly making decisions based on more information 
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than a single vector. Then the classification accuracy may increase, or 
decrease depending on how much useful new information the data com- 
pression provides the classifier, vs, how inaccurate the information 
is due to the data compression process. 

In Fig. 6.18 Curve 1 shows the classification performance of the 

Uncoded CCA when the Gaussian parametric classifier is able to train 

on original data. The test image of Fig. 6.15(a) was compressed at 

three different data rates by using 8, 4, and 2 clusters per each 

16 X 16 element local source. The classifier trained on the original 

data from the training sets defined in Table VII, and classified the 

CCA processed data by classifying the mean of each cluster. Each 

vector was given the class label of the cluster mean for the cluster 

to which the vector was assigned. The P was determined from (6.13) 

0 

and the test set of Table VIII. The classification performance for 
the same classifier and the original data is also shown in Fig. 6.18 
as the point at = 7.8% and = 8 bpppb. Clearly the classifica- 

tion performance substantially benefits from the grouping information 
provided by the CCA data compression. 

4. Performance from Training and Classifying on Compressed Data 

In many situations it may be very difficult for a remote sensor 
to identify and transmit properly selected subsets of original data 
from the image for training purposes. Then one must consider the 
additional classification accuracy degradation which may result because 
of inaccurate training from the compressed data. Curve 2 in Fig. 6.18 
shows the classification performance of the CCA under the same condi- 
tions assumed for Curve 1 except the classifier was trained on the 
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reconstruction from the compressed data. The reconstruction consisted 
simply of using the appropriate cluster mean to approximate each vector. 
Curve 2 demonstrates substantial performance degradation- relative to' 
'Curve 1, The primary reason for this degradation is that the procedure 
for training with original data is not in general appropriate for train- 
ing with compressed data. When training with original data it is 
reasonable to assume that the training sets provide a set of independent 
and identically distributed samples from which the first and second 
order statistics can be estimated. However, this assumption is less 
reasonable when dealing with a compressed representation of the train- 
ing set, since most data compression introduces increased dependence 
between vectors in a local region. The conditional distribution of all 
the samples in the image may be changed very little by the data compres- 
sion, but the conditional distribution determiiled by a small training 
subset may be substantially different. 

5. Performance from a Modified Estimate for Training Set Mean 
and Covariance 

There are several approaches to modifying the training procedure 
to account for the data compression. One could use more training sets, 
or select the same number of training set vectors from more scattered 
locations throughout the image. But both of these procedures may 
considerably complicate an already difficult task of training. Alter- 
nately, it may be possible to analytically deteirraine the impact of the 
data compression on the training for the given classifier. Clearly, a 
more sophisticated classifier for the CCA compressed data would classify 
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clusters on the basis of which conditional distribution the cluster was 
most likely to originate from. However, assume for practical reasons it 
is desired to restrict the classifier to the more simple Gaussian para- 
metric classifier. Then it is important to obtain a good estimate of 
the first and second order statistics from the CCA compressed repre- 
sentation of the training sets. 

Assume a training set of vectors for class w have been compressed 

by the CCA, Let the training set have vectors in N different clusters. 

Let p^(^) be a model distribution function for all the vectors in the 

i-th cluster (whether in the training set, or not), such that the first 

and second order statistics of p . (X) are equal to the sample mean U. and 

sample covariance the vectors in the i-th cluster, 1=1, 2, 

. , N, Then let P. define the ratio of the number of vectors from 

1 

cluster i in the training set to the total number of vectors in the 
training set. A model distribution p(]p for the training set can then 
be expressed as a mixture distribution, or 

N 

P(X) - ^ p^(X) . (6.14) 

1=1 

The first and second order statistics for p(X) can be determined in 
terms of the N cluster means and covariances, and these first and 
second order statistics for p(X) can be used as an estimate of the 
sample mean and sample covariance of the training set. 

The mean 'of p(X) is given by 


E(X) = 



N 

X p(x) dx = y^p,. 

i=l 


j APi<A) dX 
X 


» 
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or 


N 

E® . 

i“l 


( 6 .. 15 ). 


Let U “ E(X) and let [$3 be the covariance of p(X). Then can be 

found in terms of U. and [$. ] by 

—1 i 

[$] = E J^(X - U)(X - U)^j = E(ja*^) - U 


J \ pQ;) d^ - U 


f X p^(X) dX - U 
i=l •'X 

+iy^|- dd' 


i=l 

N 


N 


N 


Yi * Y 


i=l 


i»l 


i=l 



N \ 

r-A‘ ■ 

i»l / 


(6,16) 


Equations (6.15) and (6.16) define an estimate for the mean and 
covariance respectively for the training set in terms of the means and 
covariances of the clusters from the CCA processing of the training set. 
The mean of the training set is of course a simple weighted average of 
the cluster means. For the covariance in (6.16)> observe that the first 
term on the right is the weighted average of the covariance between the 
cluster means. This can be observed more clearly by looking at a given 
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element of the covariance matrix. Let u. be the j-th element of ^ , and 
£ 

let and be the element in the j-th row and k-th colirnm of [$] 
and respectively. Then from (6.16) we have 


N 


N 


N 


N 


^ L vK - E vH E 


'Jk 


i=l 


i=l 


i=»l 


i»l 


(6.17) 


The estimate for the training set covariance consists of the weighted 
average of intracliister covariance j and the weighted average of inter- 
cluster covariance between the cluster means. 

Now recall that the usual training procedure simply calculates the 
mean and< covariance of the training set vectors according to (6.11) and 
(6.12). If the training set is compressed by the CCA, then each vector 
of the training set is approximated by the mean of the cluster to which 
the vector was assigned. Thus if one uses the tjsual training on the 
CCA compressed training set, the estimate for the mean is the same as 
the estimate in (6.15), but the estimate for the covariance consists 
of only the Intercluster covariance portion of (6.17). Of course, this 
is equivalent to assuming a zero value for the intracluster variance of 
all the clusters. The classification performance corresponding to this 
training was presented earlier in Curve 2 of Fig. 6.18. 

The impact of assiiming a zero average intraclxister variance can be 
large, especially with respect to estimating the diagonal terms of the 
covariance matrix. Experimental observations confirmed that the intra- 
clxister covariance in (6.17) was frequently much larger than the inter- 
cluster covariance. As discussed in [28, p. 54l], clxiStering basically 
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Fig. 6.18. Classification Performance of 
the Uncoded CCA, n = 256, 
f = 32, m =* 2, 4, and 8. 

atten5)ts to minimize the ratio of intracluster to intercluster dis- 
tances for a given set of elements. However, one must note that a 
compresse d tra ining set may often consist of elements from clusters of 
more than one local source. Since the vectors from different local 
sources are not jointly clustered, the intercluster distance between 
clusters from different local sources may be very small. Therefore, it 
is not unreasonable to obtain an intracluster covariance greater than 
the intercluster covariance. The classification performance would 
likely be most seriously degraded when the inter cluster covariance is 
very small. Then the assumption of zero intracluster variance can 
substantially change the form of the conditional distribution function. 

Now consider how the training procedure can be modified when the 
training set is compressed by the CCA. Obviously, one procedure could 
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be to include the cluster covariance as a cluster feature transmitted 
by the CCA. Then the training set covariance estimate could be calcu- 
lated directly from (6,17). However, this procedure is likely more than 
necessary. In particular, one might anticipate that the intracluster 
covariance term in (6.17) is mainly important for the diagonal terms of 
the covariance for the training set. Perhaps the crudest modification 
of the training would be to simply assume a constant value 3^0 for 
the diagonal terms of the intracltister covariance for all clusters, and 
assume a value of zero for all off diagonal elements. This corresponds 
to assuming a sjrmmetrical distribution of variance B in each band for 
each cltister. The covariance matrix for the training set could be 
obtained from the Intercluster variance portion of (6.17), or from the 
usual training procedure, but in either case a value of B is added to 
.the diagonal elements. Curve 3 in Fig. 6.18 shows the classification 
performance when the covariance matrix for the compressed training set 
is modified by adding B = 2 to the diagonal elements. A significant 
performance increase is observable relative to the usual training pro- 
cedure which corresponds to using B = 0, and which is shown in Curve 2. 
The performance of Curve 3 was quite insensitive to changes in B 
between 1 through 5. This result demonstrates that even a very simple 
adjustment in the training procedure can make a substantial difference 
in the classification performance. 

An additional simtilation was conducted to determine how much the 
performance would improve if the CCA did transmit covariance information 
about the clusters. Only the CCA configuration of 16 16 local 

sources with 4 clusters each was tested. In addition to the cluster 
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mean, the CCA also transmitted the cluster variance per band quantized 
to 3 bits. This extra information increases the data rate of the CCA 
by only 0.0469 bpppb. The covariance matrix for each training set was- 
calculated from (6.17) by using the transmitted variance per band for 
the diagonal elements of the intracluster term, and zero for the off 
diagonal terms. The classification perfoirmance of this simulation was 
not significantly different than the corresponding performance from 
using 3=2 for each variance per band. One could continue attempting 
to improve the classification performance. For example, the classifier 
could be defined in terms of discriminant functions which result from 
modeling the conditional distributions as a weighted mixture of Gaussian 
distributions, one for each cluster. These alternate classifiers are 
not explored in this work. 

6. Discussion of the Data Compression and Data Interpretation 
Interaction » 

The relative changes in the performance curves in Fig. 6.18 indi- 
cate the importance of considering the interaction between data com- 
pression and data classification. If the classification and classifier 
training are not properly matched to the compressed data, an erroneous 
measure of classification performance is likely to be obtained. A 
similar situation exists with regard to assessing the impact of data 
compression on other data processing techniques, such as clustering, or 
geometric correction of image data. For example, in assessing the 
inpact of CCA data compression on clustering analysis, one should not 
sinply cluster the reconstructed image elements. Such a procedure again 
assumes a zero intracluster covariance for every cluster in the 
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compressed data, which can significantly degrade, or change the 
nonsupervised classification (clustering results) in much the same 
manner as occurred for supervised classification. The clustering may 
first need to be modified in order to reflect the fact that it is 
clustering clusters of nonzero covariance instead of independent 
vectors . 

7. Performance vs. Class Density per Local Source 

From Curve 3 in Fig. 6.18 the classification performance of the 
CCA is observed to be very good. In particular, the CCA processing 
enabled the Gaussian parametric classifier to reduce the average 
error rate while achieving substantial compression of the data. 

However, the absolute level of performance of the CCA will be dependent 
on the ntmaber of classes occurring within a local source. As the 
average number of classes per local source increase, the classification 
performance of the CCA will decrease. Further simulations are required 
in order to quantify the classification performance as a function of 
the number of classes per local source. One of the difficulties in 
conducting such simulations is the unavailability of test images which 
simultaneously have a higher number of classes per 16 x 16 element 
regions and reliable ground truth through the class boundaries. Some 
understanding of the impact of more classes per local source can be 
obtained from Fig. 6.15. 

As previously noted. Fig, 6.15(a) is the original image used for 
classification performance testing, and Fig, 6.15(b) is the grey-level 
coded classified image resulting from training and classifying on the 
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original data. Figure 6.15(c) shows a reconstructed image resulting 
from using the CCA to congress the original image to 4 clusters per 
16 X 16 element local source. In Fig. 6.15(d) is shown the grey-level 
coded' clas'si'fied image resulting from training and classifying on the 
CCA compressed image of Fig. 6.15(c). The training on the con5)ressed 
image data was modified as previously discussed by adding 6 = 2 to the 
diagonal elements of the covariance matrix for each class. Thus this 
classified image corresponds to one of the simulation results used in 
Curve 3 of Fig. 6.18. Figure 6.16 is a color composite image repre- 
sentation of the same results shown in Fig. 6.15, and the legend for 
associating classes and grey-levels, or colors is shown in Fig. 6.17. 
From Figs. 6.15 and 6.16 one can observe that even though the CCA was 
limited to only 4 clusters per 16 x 16 elements, the field boundaries 
and interconnecting regions with the higher number of classes per local 
source are apparently well classified. Thus the CCA classification per- 
formance appears not to be overly sensitive to increases in the number 
of classes per local source. Of course, the Adaptive CCA could ideally 
tailor the number of clusters per local source dependent on the number 
of classes present, which should result in further substantial classi- 
fication performance gains. 

8. Performance in Terms of Inventory Accuracy vs. Data Rate 

Another measure of classification performance is inventory accu- 
racy vs. total data rate. Inventory accuracy refers to the accuracy of 
the classifier in determining the relative density of the classes. In 
Table IX the true inventory data is given for the test set of 
Table VIII. Also shown in Table IX is the inventory results obtained 
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TABLE IX 

INVENTORY CLASSIFICATION PERFORMANCE 


Estimate From 

Estimate From CCA Compressed 



True 

Inventory 

Original Data 

Data 

Class 

Number 

Qt 

/q 

Number 

% 

Number 

% 

1 ‘ 

415 

2.28 

513 

2.82 

416 

2.285 

2 

6014 

33.04 

5548 

30.48 

5712 

31.38 

3 

3440 

18.90 

3486 

19.15 

3569 

19.06 

4 

512 

2.81 

619 

3,40 

598- 

3.29 

5 

604 

3.32 

856 

4.70 

785 

4,31 

6 

2398 

13.17 

2454 

13.48 

2394 

13.15 

7 

2000 

10,99 

1917 

10.53 

2010 

11.04 

8 

2820 

15.49 

2806 

15.42 

2819 

15.49 

9 

0 


4 


0 





Total = 18, 

203 




from using the Gaxissian parametric classifier with training and classi- 
fying on the original image in Fig, 6.15(a). The total data rate for 
these- inventory estimates is 8 bpppb. Table IX also shows the inventory 
estimates obtained from training and classifying on the CCA compressed 
data corresponding to the image in Fig, 6.15(c). Again the CCA used 
4 clusters per 16 16 element local source, and the classifier modi- 

fied the training covariance matrices by adding 3 ** 2 to the diagonal 
elements. The Inventory estimates are more accurate for every class, 
while the total data rate for the Uncoded CCA and the Coded CCA are 
.625 and .429 bpppb respectively. 


9. Summary Discussion of Classification Performance 
The classification performance of the CCA, of course, depends on 
the choice of classification technique. In the proceeding discussions 
the Gaussian parametric classifier was used to demonstrate the interac- 
tion between data compression and data classification. In particular. 
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the classification accuracy was observed to be sensitive to the train- 
ing procedure, and it was noted that modifications of the training may 
be appropriate when training on compressed data. The CCA data, compres- 
sion was shown to be capable of benefiting the classifier by providing 
group information on the data; however, further investigation is 
required to determine the conditions for which the group information 
degrades the classification accuracy. The previous classification per- 
formance simulations showed the CCA compression to result in signifi- 
cantly lower average vector classification error and lower inventory 
error, while substantially reducing the total data rate. It should 
also be noted that in addition to reducing the data volume, the CCA 
data compression also results in reducing the number of classifications 
required. For example, in the CCA simulations where 4 clusters are 
used per 16 x 16 element local source, the number of classifications is 
reduced by a factor of 64. 
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CHAPTER 'VII 

SUMMARY AM) RECOMMEM)ATIONS FOR FURTHER RESEARCH 

In this dissertation a joint clustering and data compression 
concept is defined for efficient transmission of image information 
which is to be obtained from manual photo interpretation and/or com- 
puter classification. Clustering is used to extract features which 
tend to preserve class separability information within the data. Data 
compression is used to provide efficient transmission of the informa- 
tion in terms of the clustering features. Various forms of a Cluster 
Compression Algorithm are defined, discussed, and extensively simulated 
to investigate application of this concept to muitispectral image data. 

The following is a list of topics for further research; 

1) The Cascaded CAA should be further investigated by incorpo- 
rating a better intercluster distance measure within the 
cascaded clustering. 

2) An appropriate test set should be obtained, or constructed 
for testing classification performance as a function of class 
density. In addition, a deteinnination of typical class den- 
sities for Intended applications is needed. 

3) Other classifiers, such as mixture distribution parametric 
classifiers, might be assessed for their impact on CCA classi- 
fication performance. 

4) Alternate feature map encoding and spatial feature extractiou 
could be explored. For example, boundary finding algorithms 
might be implemented more simply on the feature map than on 
the original data. 
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5) The use of transformations (e.g., the Karhunen-Loeve, or 

variance normalization transformations) might be investigated 
for use in preceding the CCA. 

-6)- - -The concept o'F joint clustering and data compression could 
also be researched for applications to television data and 
artificial intelligence. 
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